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Abstract 

We propose rectified factor networks (RFNs) to efficiently construct very sparse, 
non-linear, high-dimensional representations of the input. RFN models identify 
rare and small events in the input, have a low interference between code units, 
have a small reconstruction error, and explain the data covariance structure. RFN 
learning is a generalized alternating minimization algorithm derived from the pos¬ 
terior regularization method which enforces non-negative and normalized poste¬ 
rior means. We proof convergence and correctness of the RFN learning algorithm. 
On benchmarks, RFNs are compared to other unsupervised methods like autoen¬ 
coders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse 
coding methods, RFNs yield sparser codes, capture the data’s covariance struc¬ 
ture more precisely, and have a significantly smaller reconstruction error. We 
test RFNs as pretraining technique for deep networks on different vision datasets, 
where RFNs were superior to RBMs and autoencoders. On gene expression data 
from two pharmaceutical drug discovery studies, RFNs detected small and rare 
gene modules that revealed highly relevant new biological insights which were so 
far missed by other unsupervised methods. 


1 Introduction 

The success of deep learning is to a large part based on advanced and efficient input representations 
Eiaiiiiii. These representations are sparse and hierarchical. Sparse representations of the input 
are in general obtained by rectified linear units (ReLU) ll5]|6l and dropout ITJ. The key advantage of 
sparse representations is that dependencies between coding units are easy to model and to interpret. 
Most importantly, distinct concepts are much less likely to interfere in sparse representations. Using 
sparse representations, similarities of samples often break down to co-occurrences of features in 
these samples. In bioinformatics sparse codes excelled in biclustering of gene expression data 111 
and in finding DNA sharing patterns between humans and Neanderthals El- 

Representations learned by ReLUs are not only sparse but also non-negative. Non-negative repre¬ 
sentations do not code the degree of absence of events or objects in the input. As the vast majority of 
events is supposed to be absent, to code for their degree of absence would introduce a high level of 
random fluctuations. We also aim for non-linear input representations to stack models for construct¬ 
ing hierarchical representations. Finally, the representations are supposed to have a large number 
of coding units to allow coding of rare and small events in the input. Rare events are only observed 
in few samples like seldom side effects in drug design, rare genotypes in genetics, or small customer 
groups in e-commerce. Small events affect only few input components like pathways with few genes 
in biology, few relevant mutations in oncology, or a pattern of few products in e-commerce. In sum¬ 
mary, our goal is to construct input representations that (1) are sparse, (2) are non-negative, (3) are 
non-linear, (4) use many code units, and (5) model structures in the input data (see next paragraph). 

Current unsupervised deep learning approaches like autoencoders or restricted Boltzmann machines 
(RBMs) do not model specific structures in the data. On the other hand, generative models explain 
structures in the data but their codes cannot be enforced to be sparse and non-negative. The input 
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representation of a generative model is its posterior’s mean, median, or mode, which depends on the 
data. Therefore sparseness and non-negativity cannot be guaranteed independent of the data. For 
example, generative models with rectified priors, like rectified factor analysis, have zero posterior 
probability for negative values, therefore their means are positive and not sparse nniin]. Sparse 
priors do not guarantee sparse posteriors as seen in the experiments with factor analysis with Lapla- 
cian and Jeffrey’s prior on the factors (see T ab. [T) . To address the data dependence of the code, we 
employ the posterior regularization method 1121 . This method separates model characteristics from 
data dependent characteristics that are enforced by constraints on the model’s posterior. 

We aim at representations that are feasible for many code units and massive datasets, therefore 
the computational complexity of generating a code is essential in our approach. For non-Gaussian 
priors, the computation of the posterior mean of a new input requires either to numerically solve 
an integral or to iteratively update variational parameters m- In contrast, for Gaussian priors the 
posterior mean is the product between the input and a matrix that is independent of the input. Still 
the posterior regularization method leads to a quadratic (in the number of coding units) constrained 
optimization problem in each E-step (see Eq. Q below). To speed up computation, we do not solve 
the quadratic problem but perform a gradient step. To allow for stochastic gradients and fast GPU 
implementations, also the M-step is a gradient step. These E-step and M-step modifications of the 
posterior regularization method result in a generalized alternating minimization (GAM) algorithm 
II3. We will show that the GAM algorithm used for REN learning (i) converges and (ii) is correct. 
Correctness means that the REN codes are non-negative, sparse, have a low reconstruction error, and 
explain the covariance structure of the data. 


2 Rectified Factor Network 


Our goal is to construct representations of the input that (1) are sparse, (2) are non-negative, (3) are 
non-linear, (4) use many code units, and (5) model structures in the input. Structures in the input 
are identified by a generative model, where the model assumptions determine which input structures 
to explain by the model. We want to model the covariance structure of the input, therefore we 
choose maximum likelihood factor analysis as model. The constraints on the input representation 
are enforced by the posterior regularization method na. Non-negative constraints lead to sparse 
and non-linear codes, while normalization constraints scale the signal part of each hidden (code) 
unit. Normalizing constraints avoid that generative models explain away rare and small signals by 
noise. Explaining away becomes a serious problem for models with many coding units since their 
capacities are not utilized. Normalizing ensures that all hidden units are used but at the cost of coding 
also random and spurious signals. Spurious and true signals must be separated in a subsequent step 
either by supervised techniques, by evaluating coding units via additional data, or by domain experts. 

A generative model with hidden units h and data v is defined by its prior p{h) and its likelihood 
p{v I h). The full model distribution p{h,v) — p{v \ h)p{h) can be expressed by the model’s 
posterior p(h. | v) and its evidence (marginal likelihood) p{v): p{h,v) = p{h \ v)p(v). The 
representation of input v is the posterior’s mean, median, or mode. The posterior regularization 
method introduces a variational distribution Q{h \ v) G Q from a family Q, which approximates 
the posterior p{h \ v). We choose Q to constrain the posterior means to be non-negative and 
normalized. The full model distribution p{h,v) contains all model assumptions and, thereby, defines 
which structures of the data are modeled. Q{h \ v) contains data dependent constraints on the 
posterior, therefore on the code. 

Eor data {r;} = {rti,..., «„}, the posterior regularization method maximizes the objective T ifT^ : 
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where Ukl is the Kullback-Leibler distance. Maximizing IF achieves two goals simultaneously: (1) 
extracting desired structures and information from the data as imposed by the generative model and 
(2) ensuring desired code properties via Q G Q. 
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The factor analysis model v = Wh + e extracts the covari¬ 
ance structure of the data. The prior h ~ Af {0,1) of the 
hidden units (factors) h G M} and the noise e ^ A/^(0,'5') 
of visible units (observations) v G ffi™ are independent. The 
model parameters are the weight (loading) matrix W G 
and the noise covariance matrix G We assume di¬ 

agonal to explain correlations between input components by 
the hidden units and not by correlated noise. The factor analy¬ 
sis model is depicted in Fig. [T] Given the mean-centered data 
{r;} = {rti,..., t)„}, the posterior| v^) is Gaussian with 
mean vector (fJ-p)i and covariance matrix 

(/xp), = (/ + V, , 

Sp = (I + . (2) 

A rectified factor network (RFN) consists of a single or stacked factor analysis model(s) with con¬ 
straints on the posterior. To incorporate the posterior constraints into the factor analysis model, 
we use the posterior regularization method that maximizes the objective iF given in Eq. 0 iia. 
Like the expectation-maximization (EM) algorithm, the posterior regularization method alternates 
between an E-step and an M-step. Minimizing the first Z?kl of Eq. 0 with respect to Q leads to a 
constrained optimization problem. Eor Gaussian distributions, the solution with {fJ,p)i andSp from 
Eq. 0 is Q{hi \ vf) ^ N {fii, S) with S = Sp and the quadratic problem: 

^ n 1 ^ 

min - ^(/x* - (/Xp)i)^ (/x, - (/Xp),) , s.t. V, : /x, > 0 , ^ = 1 , (3) 

f-t’i 7Z 77 

2=1 2=1 

where “>” is component-wise. This is a constraint non-convex quadratic optimization problem in 
the number of hidden units which is too complex to be solved in each EM iteration. Therefore, we 
perform a step of the gradient projection algorithm mna, which performs first a gradient step 
and then projects the result to the feasible set. We start by a step of the projected Newton method, 
then we try the gradient projection algorithm, thereafter the scaled gradient projection algorithm 
with reduced matrix ifT^ (see also ifTSll l. If these methods fail to decrease the objective in Eq. 0, 
we use the generalized reduced method El. It solves each equality constraint for one variable and 
inserts it into the objective while ensuring convex constraints. Alternatively, we use Rosen’s gradient 
projection method IITSi or its improvement ifT^ . These methods guarantee a decrease of the E-step 
objective. 

Since the projection P by Eq. 0 is very fast, the projected Newton and projected gradient up¬ 
date is very fast, too. A projected Newton step requires 0{nl) steps (see Eq. 0 and P defined 
in Theorem 0, a projected gradient step requires 0{mhi{nlm, nP}) steps, and a scaled gradient 
projection step requires 0{nP) steps. The RFN complexity per iteration is 0{n{im? -f P)) (see 
Alg. 0;. In contrast, a quadratic program solver typically requires for the {nl) variables (the means 
of the hidden units for all samples) 0{n^l^) steps to find the minimum lEOl . We exemplify these 
values on our benchmark datasets MNIST (n = 50k, I = 1024, m = 784) and CIFAR (n = 50k, 
I = 2048, m = 1024). The speedup with projected Newton or projected gradient in contrast to 
a quadratic solver is 0{iiAP) = 0{nH^)/0{nP), which gives speedup ratios of 1.3 • 10^° for 
MNIST and 5.2 • 10^° for CIFAR. These speedup ratios show that efficient E-step updates are 
essential for REN learning. Eurthermore, on our computers, RAM restrictions limited quadratic 
program solvers to problems with nl < 20k. 

The M-step decreases the expected reconstruction error 

£ = -/ Q{h, I V,) log {p{vi I hi)) dhi (4) 

= \{m log(27r) -f log I’®'! -f Tr - 2 Tr + Tr IVES') ) . 

from Eq. 0 with respect to the model parameters W and ®'. Definitions of C, U and S are 
given in Alg. [T] The M-step performs a gradient step in the Newton direction, since we want to 



Figure 1: Factor analysis model: 
hidden units (factors) h, visible 
units V, weight matrix W, noise e. 
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Algorithm 1 Rectified Factor Network. 


C l T 

— n = l 

while STOP=false do 

-E-stepl- 

for all 1 < i < n do 

(/Xp)i = {I + w'^<b-^wy^ 

end for 

S = = (/ + 


-Constraint Posterior- 


(1) projected Newton, (2) projected gradient, 

(3) scaled gradient projection, (4) generalized 
reduced method, (5) Rosen’s gradient project. 

Complexity: objective T: 0(min{nlm,nP} + Z^); E-stepl: 
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■ 1 Vi fJji 

1 fii /r-i + S 


-E-step2- 

u ^kT.1 
s-lTt 

-M-step- 

E = C - U W'^ - WU + W SW'^ 
W = W + r) [U S-'- - W) 
for all 1 < fc < m do 

5'fcfc = iPfefe + rj {Ekk — ^kk) 

end for 

if stopping criterion is met: STOP=true 

end while 

2/.™ ( 1\ 12/ 


0(min{m‘^(m -|- l),r{m + Z)} -|- nlm)'. 


projected Newton: 0(nZ); projected gradient: C)(min{nZm, nZ'^}); scaled gradient projection: 0(nr 


step2: 0(nZ(m-|-Z)); M-step: 0(mZ(m-|-Z)); overall complexity with projectedNewton/gradientfor (Z+m) < 
n: 0{n{m^ + Z^)). 


allow stochastic gradients, fast GPU implementation, and dropout regularization. The Newton step 
is derived in the supplementary which gives further details, too. Also in the E-step, REN learning 
performs a gradient step using projected Newton or gradient projection methods. These projection 
methods require the Euclidean projection P of the posterior means {{Hp)i} onto the non-convex 
feasible set: 


min 

Mi 


1 

n 


2=1 






S.t. 


ft* > 0 , 



Z = 1 


1. (5) 


The following Theoremgives the Euclidean projection P as solution to Eq. (|^. 

Theorem 1 (Euclidean Projection). If at least one is positive for I < j <1, then the solution 

to optimization problem Eq. © is 


b'ij ~ [P((A*p)i)]j 



0 for {p.p)ij < 0 

(.Itp)ij for ^ 0 


( 6 ) 


If all {p,p)ij are non-positive for 1 < j < (, then the optimization problem Eq. © has the solution 
Pij = y/n for j = arg max^K/Xpj^j} and p,ij = 0 otherwise. 


Proof. See supplementary material. □ 

Using the projection P defined in Eq. the E-step updates for the posterior means pi are: 

= V{py + 7(d - , d = V{py + AJf-1 ^f\{Pp)i - pf^)) (7) 

where we set for the projected Newton method H~^ — Sp (thus = I), and for the 

projected gradient method H~^ = I. Eor the scaled gradient projection algorithm with reduced 
matrix, the e-active set for i consists of all j with pij < e. The reduced matrix H is the Hessian 
Sp ^ with e-active columns and rows j fixed to unit vectors ej. The resulting algorithm is a posterior 
regularization method with a gradient based E- and M-step, leading to a generalized alternating 
minimization (GAM) algorithm ED. The REN learning algorithm is given in Alg. [T] Dropout 
regularization can be included before E-step2 by randomly setting code units pij to zero with a 
predefined dropout rate (note that convergence results will no longer hold). 


3 Convergence and Correctness of RFN Learning 

Convergence of RFN Learning. Theorem|^states that Alg. [T] converges to a maximum of E. 

Theorem 2 (REN Convergence). The rectified factor network (RFN) learning algorithm given in 
AZg.0 is a "generalized alternating minimization ” (GAM) algorithm and converges to a solution 
that maximizes the objective T. 
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Proof. We present a sketch of the proof which is given in detail in the supplement. For convergence, 
we show that Alg.[T]is a GAM algorithm which convergences according to Proposition 5 in ETIl . 

Alg .[^ensures to decrease the M-step objective which is convex in W and The update with 

77 = 1 leads to the minimum of the objective. Convexity of the objective guarantees a decrease in the 
M-step for 0 < p < 1 if not in a minimum. Alg. [^ensures to decrease the E-step objective by using 
gradient projection methods. All other requirements for GAM convergence are also fulfilled. □ 

Proposition 5 in ll2Tll is based on ZangwilPs generalized convergence theorem, thus updates of the 
REN algorithm are viewed as point-to-set mappings Ell . Therefore the numerical precision, the 
choice of the methods in the E-step, and GPU implementations are covered by the proof. 


Correctness of RFN Learning. The goal of the REN algorithm is to explain the data and its 
covariance structure. The expected approximation error E is defined in line 14 of Alg.[T] Theorem]^ 
states that the REN algorithm is correct, that is, it explains the data (low reconstruction error) and 
captures the covariance structure as good as possible. 

Theorem 3 (REN Correctness). The fixed point W of Alg. ^minimizes Tr (’®') given pi and S by 
ridge regression with 


Tr(’®') 


1 

n 


11GII2 + 

i=l 


W 


2 

F 


( 8 ) 


where — Wp,. The model explains the data covariance matrix by 

c = ^ + w SW^ 


(9) 


up to an error, which is quadratic in ^ for ^ ^ WW"^. The reconstruction error ^ 11^*112 

is quadratic in \I/ for ’S' < WW'^. 


Proof. The fixed point equation for the TU update is A VE = US ^ — W = 0 ^ W = US 
Using the definition of (7 and S', we have W = I’i ) (^ Mi + ^) 

is the ridge regression solution of 


- + EESI/2 = Tr - + VESEE^ 


( 10 ) 


where Tr is the trace. After multiplying out all in l/nX)r=i we obtain; 


E = 


e,; eT 


WSEE^. 


( 11 ) 


Eor the fixed point of ’S', the update rule gives: diag (’S') = diag + WSEE^). 

Thus, VE minimizes Tr(’S') given pi and S. Multiplying the Woodbury identity for 
(EEW^ -t- ’S') from left and right by ^ gives 

EESEE^ = ^ ^ (EE EE^-f’S')^^ ’S'. (12) 

Inserting this into the expression for diag (’S') and taking the trace gives 

=Tr(^’S'(EEEE'^-f’S')”^’S') < Tr -f Tr (’S')^ . (13) 

Therefore for ’S' ^ EEEE^ the error is quadratic in ’S'. EEU^ = EESEE^ = UEE^ follows from 
fixed point equation U = WS. Using this and Eq. ( [T^ , Eq.([TT) is 

1 ^ 

- ^ e, ef - S' (EE EE^ -f ’S')”^ ’S' = C - S' - EE S EE"^ . (14) 

2 = 1 

Using the trace norm (nuclear norm or Ky-Ean n-norm) on matrices, Eq. ([T3) states that the left 
hand side of Eq. ( [T^ is quadratic in ’S' for S' ^ EEEE^. The trace norm of a positive semi-definite 
matrix is its trace and bounds the Erobenius norm ll23l . Thus, for ’S' ^ EEEE^, the covariance is 
approximated up to a quadratic error in ’S' according to Eq. (|^. The diagonal is exactly modeled. □ 
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Since the minimization of the expected reconstruction error Tr (\I/) is based on the quality of 
reconstruction depends on the correlation between and Vi. We ensure maximal information in /x^ 
on Vi by the I-projection (the minimal Kullback-Leibler distance) of the posterior onto the family of 
rectified and normalized Gaussian distributions. 


4 Experiments 

RFNs vs. Other Unsupervised Methods. We assess the performance of rectified factor networks 
(RFNs) as unsupervised methods for data representation. We compare (1) RFN: rectified factor net¬ 
works, (2) RFNn; RFNs without normalization, (3) DAE: denoising autoencoders with ReLUs, (4) 
RBM: restricted Boltzmann machines with Gaussian visible units, (5) FAsp: factor analysis with 
Jeffrey’s prior {p{z) oc 1/z) on the hidden units which is sparser than a Laplace prior, (6) FAlap: 
factor analysis with Laplace prior on the hidden units, (7) ICA: independent component analysis 
by FastICA ll24ll . (8) SFA: sparse factor analysis with a Laplace prior on the parameters, (9) FA: 
standard factor analysis, (10) PCA: principal component analysis. The number of components are 
fixed to 50, 100 and 150 for each method. We generated nine different benchmark datasets (D1 to 
D9), where each dataset consists of 100 instances. Each instance has 100 samples and 100 features 
resulting in a lOOx 100 matrix. Into these matrices, biclusters are implanted [81. A bicluster is a 
pattern of particular features which is found in particular samples like a pathway activated in some 
samples. An optimal representation will only code the biclusters that are present in a sample. The 
datasets have different noise levels and different bicluster sizes. Large biclusters have 20-30 sam¬ 
ples and 20-30 features, while small biclusters 3-8 samples and 3-8 features. The pattern’s signal 
strength in a particular sample was randomly chosen according to the Gaussian A/^ (1,1). Finally, 
to each matrix, zero-mean Gaussian background noise was added with standard deviation 1, 5, or 
10. The datasets are characterized by Dx=((t, ni, 7 x 2 ) with background noise cr, number of large 
biclusters rxi, and the number of small biclusters 7 x 2 : Dl=(l,10,10), D2=(5,10,10), D3=(l0,10,10), 
D4=(l,15,5), D5=(5,15,5), D6=(10,15,5), D7=(l,5,15), D8=(5,5,15), D9=(10,5,15). 

We evaluated the methods according to the (1) sparseness of the components, the (2) input recon¬ 
struction error from the code, and the (3) covariance reconstruction error for generative models. 
For RFNs sparseness is the percentage of the components that are exactly 0, while for others meth¬ 
ods it is the percentage of components with an absolute value smaller than 0.01. The reconstruction 
error is the sum of the squared errors across samples. The covariance reconstruction error is the 
Frobenius norm of the difference between model and data covariance. See supplement for more 
details on the data and for information on hyperparameter selection for the different methods. Tab.[T] 
gives averaged results for models with 50 (undercomplete), 100 (complete) and 150 (overcomplete) 
coding units. Results are the mean of 900 instances consisting of 100 instances for each dataset D1 
to D9. In the supplement, we separately tabulate the results for D1 to D9 and confirm them with dif¬ 
ferent noise levels. FAlap did not yield sparse codes since the variational parameter did not push the 

Table 1: Comparison of RFN with other unsupervised methods, where the upper part contains meth¬ 
ods that yielded sparse codes. Criteria: sparseness of the code (SP), reconstruction error (ER), 
difference between data and model covariance (CO). The panels give the results for models with 50, 
100 and 150 coding units. Results are the mean of 900 instances, 100 instances for each dataset D1 
to D9 (maximal value: 999). RENs had the sparsest code, the lowest reconstruction error, and the 
lowest covariance approximation error of all methods that yielded sparse representations (SP>10%). 


undercomplete 50 code units complete 100 code units overcomplete 150 code units 



SP 

ER 

CO 

SP 

ER 

CO 

SP 

ER 

CO 

REN 

75±o 

249±3 

I08±3 

81±i 

68±9 

26±6 

85±i 

17±6 

7±6 

RENn 

74±o 

295±4 

I40±4 

79±o 

185±5 

59±3 

80±o 

142±4 

35±2 

DAE 

66±o 

251 ±3 

— 

69±o 

147±2 

— 

7I±o 

130±2 

— 

RBM 

15±i 

3I0±4 

— 

7±i 

287±4 

— 

5±o 

286±4 

— 

EAsp 

40±i 

999±63 

999 ±99 

63±o 

999±65 

999±99 

80±o 

999±65 

999±99 

FAlap 

4±o 

239±6 

341±i9 

6±o 

46 ±4 

985±45 

4±o 

46 ±4 

976±53 

ICA 

2±o 

I74±2 

— 

3±i 

0±o 

— 

3±i 

0±o 

— 

SFA 

1±0 

218±5 

94±3 

1±0 

16±i 

114±5 

1±0 

16±i 

285 ±7 

FA 

1±0 

218±4 

90±3 

1±0 

16±i 

83±4 

1±0 

16±i 

263 ±6 

PCA 

0±o 

I74±2 

— 

2±o 

0±o 

— 

2±o 

0±o 

— 
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(a) MNIST digits 
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(b) MNIST digits with random image background 


(c) MNIST digits with random noise background 


(d) convex and concave shapes 




(e) tall and wide rectangular (f) rectangular images on background images 



(g) CIFAR-10 images (best viewed in color) (h) NORB images 

Figure 2: Randomly selected filters trained on image datasets using an RFN with 1024 hidden units. 
RFNs learned stroke, local and global blob detectors. RFNs are robust to background noise (b,c,f). 


absolute representations below the threshold of 0.01. The variational approximation to the Lapla- 
cian is a Gaussian distribution 113. RFNs had the sparsest code, the lowest reconstruction error, 
and the lowest covariance approximation error of all methods that yielded sparse representations 
(SP>10%). 


RFN Pretraining for Deep Nets. We assess the performance of rectified factor networks (RFNs) 
if used for pretraining of deep networks. Stacked RFNs are obtained by first training a single layer 
RFN and then passing on the resulting representation as input for training the next RFN. The deep 
network architectures use a RFN pretrained first layer (RFN-1) or stacks of 3 RFNs giving a 3- 
hidden layer network. The classification performance of deep networks with RFN pretrained layers 
was compared to (i) support vector machines, (ii) deep networks pretrained by stacking denoising 
autoencoders (SDAE), (iii) stacking regular autoencoders (SAE), (iv) restricted Boltzmann machines 
(RBM), and (v) stacking restricted Boltzmann machines (DBN). 

The benchmark datasets and results are taken from previous publications 1^ l26\ l27l |28l and con¬ 
tain: (i) MNIST (original MNIST), (ii) basic (a smaller subset of MNIST for training), (iii) bg-rand 
(MNIST with random noise background), (iv) bg-img (MNIST with random image background), (v) 
rect (discrimination between tall and wide rectangles), (vi) rect-img (discrimination between tall and 
wide rectangular images overlayed on random background images), (vii) convex (discrimination be¬ 
tween convex and concave shapes), (viii) CIFAR-10 (60k color images in 10 classes), and (ix) NORB 
(29,160 stereo image pairs of 5 generic categories). Eor each dataset its size of training, validation 
and test set is given in the second column of Tab. As preprocessing we only performed median 
centering. Model selection is based on the validation set performance E6\ . The RENs hyperparam¬ 
eters are (i) the number of units per layer from {1024, 2048,4096} and (ii) the dropout rate from 
{0.0, 0.25,0.5,0.75}. The learning rate was fixed to its default value of p = 0.01. Eor supervised 
fine-tuning with stochastic gradient descent, we selected the learning rate from {0.1, 0.01, 0.001}, 

Table 2: Results of deep networks pretrained by RENs and other models (taken from ll^ 1^ IZTl 
l28l ). The test error rate is reported together with the 95% confidence interval. The best performing 
method is given in bold, as well as those for which confidence intervals overlap. The first column 
gives the dataset, the second the size of training, validation and test set, the last column indicates 
the number of hidden layers of the selected deep network. In only one case REN pretraining was 
significantly worse than the best method but still the second best. In six out of the nine experiments 
REN pretraining performed best, where in four cases it was significantly the best. 


Dataset 


SVM 

RBM 

DBN 

SAE 

SDAE 

REN 

MNIST 

50k-10k-10k 

1.40±0.23 

1.21±0.21 

1.24±0.22 

1.40±0.23 

1.28±0.22 

1.27±0.22 (1) 

basic 

10k-2k-50k 

3.03±o.i5 

3.94±o.i7 

3. 11 ±0.15 

3.46±o.i6 

2.84±o.is 

2.66±o.i4 (1) 

bg-rand 

10k-2k-50k 

14.58±o.3i 

9.80±0.26 

6.73 ±0.22 

11.28±0.28 

10.30±0.27 

7.94±o.24(3) 

bg-img 

10k-2k-50k 

22.61 ±0.37 

16.15±0.32 

16.31±0.32 

23.00±0.37 

16.68±o.33 

15.66±o.32 (1) 

rect 

lk-0,2k-50k 

2.15±o.i3 

4.71 ±0.19 

2.60±o.i4 

2.41±o.i3 

1.99±o.i2 

0.63±o.o6 (1) 

rect-img 

10k-2k-50k 

24.04±o.37 

23.69±o.37 

22.50±o.37 

24.05 ±0.37 

21.59±o.36 

20.77±o.36 (1) 

convex 

10k-2k-50k 

19.13±0.34 

19.92±o.35 

18.63±o.34 

18.41±0.34 

19.06±0.34 

16.41±0.32 (1) 

NORB 

19k-5k-24k 

b^ 

O 

o 

8.31 ±0.35 

- 

10.10±0.38 

9.50±0.37 

7.00±0.32 (1) 

CIEAR 

40k-10k-10k 

62. 7 ±0.95 

40.39±o.96 

43.38±o.97 

43.25±o.97 

- 

41.29±o.95 (1) 
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Figure 3: Examples of small and rare events identified by REN in two drug design studies, which 
were missed by previous methods. Panel A and B; first row gives the coding unit, while the other 
rows display expression values of genes for controls (red), active drugs (green), and inactive drugs 
(black). Drugs (green) in panel A strongly downregulate the expression of tubulin genes which 
hints at a genotoxic effect by the formation of micronuclei (C). The micronuclei were confirmed by 
microscopic analysis (D). Drugs (green) in panel B show a transcriptional effect on genes with a 
negative feedback to the MAPK signaling pathway (E) and therefore are potential cancer drugs. 

the masking noise from {0.0,0.25}, and the number of layers from {1,3}. Eine-tuning was stopped 
based on the validation set performance, following ESh . The test error rates together with the 95% 
confidence interval (computed according to l26l) for deep network pretraining by RENs and other 
methods are given in Tab. Eig.|^ shows learned filters. The result of the best performing method 
is given in bold, as well as the result of those methods for which confidence intervals overlap. RENs 
were only once significantly worse than the best method but still the second best. In six out of the 
nine experiments RENs performed best, where in four cases it was significantly the best. 

RFNs in Drug Discovery. Using RENs we analyzed gene expression datasets of two projects in 
the lead optimization phase of a big pharmaceutical company E9\ . The first project aimed at finding 
novel antipsychotics that target PDEIOA. The second project was an oncology study that focused 
on compounds inhibiting the EGE receptor. In both projects, the expression data was summarized 
by EARMS ll^ and standardized. RENs were trained with 500 hidden units, no masking noise, and 
a learning rate of p = 0.01. The identified transcriptional modules are shown in Eig. Panels A 
and B illustrate that RENs found rare and small events in the input. In panel A only a few drugs are 
genotoxic (rare event) by downregulating the expression of a small number of tubulin genes (small 
event). The genotoxic effect stems from the formation of micronuclei (panel C and D) since the 
mitotic spindle apparatus is impaired. Also in panel B, REN identified a rare and small event which 
is a transcriptional module that has a negative feedback to the MAPK signaling pathway. Rare events 
are unexpectedly inactive drugs (black dots), which do not inhibit the PGP receptor. Both findings 
were not detected by other unsupervised methods, while they were highly relevant and supported 
decision-making in both projects ll29ll . 

5 Conclusion 

We have introduced rectified factor networks (RENs) for constructing very sparse and non-linear 
input representations with many coding units in a generative framework. Like factor analysis, REN 
learning explains the data variance by its model parameters. The REN learning algorithm is a poste¬ 
rior regularization method which enforces non-negative and normalized posterior means. We have 
shown that REN learning is a generalized alternating minimization method which can be proved 
to converge and to be correct. RENs had the sparsest code, the lowest reconstruction error, and the 
lowest covariance approximation error of all methods that yielded sparse representations (SP>10%). 
RENs have shown that they improve performance if used for pretraining of deep networks. In two 
pharmaceutical drug discovery studies, RENs detected small and rare gene modules that were so far 
missed by other unsupervised methods. These gene modules were highly relevant and supported 
the decision-making in both studies. RENs are geared to large datasets, sparse coding, and many 
representational units, therefore they have high potential as unsupervised deep learning techniques. 

Acknowledgment. The Tesla K40 used for this research was donated by the NVIDIA Corporation. 
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SI Introduction 


This supplement contains additional information complementing the main manuscript and is struc¬ 
tured as follows: First, the rectified factor network (RFN) learning algorithm with E- and M-step 
updates, weight decay and dropout regularization is given in Section S2 In Section we proof 
that the (RFN) learning algorithm is a “generalized alternating minimization” (GAM) algorithm and 
converges to a solution that maximizes the RFN objective. The correctness of the RFN algorithm is 
proofed in Section]^ Section [^describes the maximum likelihood factor analysis model and the 
model selection by the EM-algorithm. The RFN objective, which has to be maximized, is described 
in Section]^ Next, RFN’s GAM algorithm via gradient descent both in the M-step and the E-step 
is reported in the Section]^ The following sections [S8] and [S9|describe the gradient-based M- and 
E-step, respectively. In Section [ST0| we describe how the RFNs sparseness can be controlled by a 
Gaussian prior. Additional information on the selected hyperparameters of the benchmark methods 
is given in Section [STT] The sections S12 and ST^ describe the data generation of the benchmark 
datasets and report the results for three different experimental settings, namely for extracting 50 


(undercomplete), 100 (complete) or 150 (overcomplete) factors / hidden units. Finally, Section S14 


describes experiments, that we have done to assess the performance of RFN^rsf layer pretraining 
on CIFAR-10 and CIFAR-100 for three deep convolutional network architectures: (i) the AlexNet 
ED [321, (ii) Deeply Supervised Networks (DSN) ES), and (iii) our 5-Convolution-Network-In- 
Network (5C-NIN). 


S2 Rectified Factor Network (RFN) Algorithms 


Algorithm 

Algorithm 


S2 


S3 


is the rectified factor network (RFN) learning algorithm. The RFN algorithm calls 
to project the posterior probability pi onto the family of rectified and normalized 


variational distributions Qi. Algorithm [^guarantees an improvement of the E-step objective O = 
DxhiQi II Pi)- Projection Algorithm |S3| relies on different projections, where a more 
complicated projection is tried if a simpler one failea to improve the E-step objective. If all following 
Newton-based gradient projection methods fail to decrease the E-step objective, then projection 
Algorithm[^falls back to gradient projection methods. First the equality constraints are solved and 
inserted into the objective. Thereafter, the constraints are convex and gradient projection methods 
are applied. This approach is called “generalized reduced gradient method” lEl, which is our 
preferred alternative method. If this method fails, then Rosen’s gradient projection method ifTSll is 
used. Finally, the method of Haug and Arora Gl is used. 

First we consider Newton-based projection methods, which are used by Algorithm|S3| Algorithm|S5| 
performs a simple projection, which is the projected Newton method with learning rate set to one. 
This projection is very fast and ideally suited to be performed on GPUs for RFNs with many coding 
units. Algorithm[^is the fast and simple projection without normalization even simpler than Algo- 
rithm[^ Algorithm [^generalizes Algorithm[^by introducing step sizes A and 7. The step size 
A scales the gradient step, while 7 scales the difference between to old projection and the new pro¬ 
jection. For both A and 7 annealing steps, that is, learning rate decay is used to find an appropriate 
update. 

If these Newton-based update rules do not work, then Algorithm[^is used. Algorithm |S7| performs 
a scaled projection with a reduced Hessian matrix H instead of the full Hessian S'For computing 
H an e-active set is determined, which consists of all j with pj < e. The reduced matrix H is the 
Hessian with e-active columns and rows j fixed to unit vector ej. 

The RFN algorithm allows regularization of the parameters W and tP (off-diagonal elements) by 
weight decay. Priors on the parameters can be introduced. If the priors are convex functions, then 
convergence of the RFN algorithm is still ensured. The weight decay Algorithm [^ can optionally 
be used after the M-step of Algorithm [^ Coding units can be regularized by dropout. However 
dropout is not covered by the convergence proof for the RFN algorithm. The dropout Algorithm|S9| 
is applied during the projection between rectifying and normalization. Methods like mini-batches or 
other stochastic gradient methods are not covered by the convergence proof for the RFN algorithm. 
However, in ll^ it is shown how to generalize the GAM convergence proof to mini-batches as it 
is shown for the incremental EM algorithm. Dropout and other stochastic gradient methods can be 
show to converge similar to mini-batches. 
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Algorithm S2 Rectified Factor Network 

Input 

for I < i < n: Vi G K™, 
number of coding units I 

Hyper-Parameters 

^min, VFmax, VW, p,T,l<r]<l 

Initialization 

'i' = tI, W element-wise random in [—p, p], 

C ^ i ELi«fc^i’>STOP=false 

Main 

while STOP=false do 

-E-step 1- 

for aU 1 < i < n do 

{pip), = {I + Vi 

end for 
S = (J + 

-Projection- 

perform projection of {pip)i onto the feasible set by Algorithm [^giving pii 

-E-step2- 

u = ^^yy,v,py 
s = ^ + s 

-M-step- 

Pw = m = p 

E = C- UW^-WU + WSW^ 

- W update - 

w = W + pw {u S-^ - W) 

- diagonal ’S' update - 

for all 1 < fc < m do 

'pfcfe = '^kk + p^ {Ekk — 'Pfefc) 

end for 

—full ’S' update - 

’S' = ’S' + - ’S') 

- bound parameters - 

W = median!-VFmax , W , PPmax} 

’S' = median! S'min , ’®' , max!C}} 

if stopping criterion is met: STOP=true 

end while 
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Algorithm S3 Projection with E-Step Improvement 

Goal 

obtain = fii that decrease the E-step objective 

Input 

^new _ ^ “ y]old 

for 1 < * < n: (/Tp)*, = U{{pip)i, Sp) 

simple projection P (rectified or rectified & normalized). 

E-step objective; 0 = ^ DKi.{Qi || Pi) 

Tniini '^niini P75 P\^ ^ (for 6-aCtive Set) 

Main 

-Simple Projection- 

perform Newton Projection by Algorithm |S5| or Algorithm|S4| 

-Scaled Projection- 

if 0 < AO then 

following loop for; ( 1 ) 7 , (2) A, or (3) 7 and A annealing 
7 = A = 1 

while 0 < AO and A > Amin and 7 > 7 min do 
7 = 7 (skipped for A annealing) 

A = Pa A (skipped for 7 annealing) 

perform Scaled Newton Projection by Algorithm|S 6 | 

end while 
end if 

-Scaled Projection With Reduced Matrix- 

if 0 < AO then 

determine e-active set as all j with pj < e 
set H to Sp ^ with e-active columns and rows j fixed to 
following loop for; ( 1 ) 7 , (2) A, or (3) 7 and A annealing 
7 = A = 1 

while 0 < AO and A > Amin and 7 > 7 min do 
7 = 7 (skipped for A annealing) 

^ = Pa A (skipped for 7 annealing) 

perform Scaled Projection With Reduced Matrix by Algorithm|S7| 

end while 
end if 

-General Gradient Projection- 

while 0 < AO do 

use generalized reduced gradient ifTTl OR 
use Rosen’s gradient projection ifTSll OR 
use method of Haug and Arora nsi 

end while 


Algorithm S4 Simple Projection; Rectifying 
Goal 

for 1 < * < n; project (/Xp)i onto feasible set giving pi 

Input 

{f^p)i 

Main 

for all 1 < j < Z do 

Py = max|o, 

end for 
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Algorithm S5 Simple Projection; Rectifying and Normalization 


Goal 

for 1 < i < n: project (/Xp)i onto feasible set giving fii 

Input 

for 1 < i < n: {f^p)i 

Rectifier 

for all 1 < t < n do 
for all 1 < j < ^ do 

Aij = niax|o, 

end for 
end for 
Normalizer 

for all 1 < t < n do 

if at least one > 0 then 
for all 1 < j < ? do 

If.. — __ 

end for 
else 

for all 1 < j < Z do 

_ J ^/n for j = argmax-{[(p 

I n VlOVTJT-l CO 


end for 
end if 
end for 


j = aigmax~.{[{fip)i\~} 


otherwise 


Algorithm S6 Scaled Newton Projection 

Goal 

perform a scaled Newton step with subsequent projection 

Input 

for 1 < i < n: {f^p)i 
for I < i < n: 

simple projection P (rectified or rectified & normalized), 
A (gradient step size), 7 (projection difference) 

Main 

d = P (/Xfd + ^ 

M-'" = +l{d- 


Algorithm S7 Scaled Projection With Reduced Matrix 

Goal 

perform a scaled projection step with reduced matrix 

Input 

for 1 < i < n: {f^p)i 
for I < i < n: 

simple projection P (rectified or rectified & normalized), 

A, 7 , H, 

Main 

d = P(/xfd + AJf-i ^p\{^^p)^ - 
/r-'" = Vilify +l{d- 
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Algorithm S8 Weight Decay 

Input 

Parameters IV 

Weight decay factors (Gaussian) and 7 ^, (Laplacian) 

Gaussian 

W = W - -fcW 

Laplacian 

W = median{— 72 , , W , 7 ^} 

W = W - W 


Algorithm S9 Dropout 
Input 

for 1 < i < n: fii 
dropout probability d 

Main 

for all 1 < i < n do 
for all 1 < j < ^ do 

Pr((5 = 0) = d 

— d /Tjj 

end for 
end for 


S3 Convergence Proof for the RFN Learning Algorithm 


Theorem 4 (RFN Convergence). The rectified factor network (RFN) learning algorithm given in 
Algorithm |52| is a “generalized alternating minimization’’ (GAM) algorithm and converges to a 
solution that maximizes the objective T. 


Proof. The factor analysis EM algorithm is given by Eq. ( |8T] l and Eq. ( |8^ in Section S5 Algo- 
rithm|^is the factor analysis EM algorithm with modified the E-step and the M-step. The E-step is 
modified by constraining the variational distribution Q to non-negative means and by normalizing 
its means across the samples. The M-step is modified to a Newton direction gradient step. 

Like EM factor analysis, Algorithmaims at maximizing the negative/ree energy T, which is 


^ n ^ n 

^ = - VlogpK)- '^DKhiQihi) II p(h, I V,)) 

i=l 2=1 

= - f \ogp{vi)dh, - - X! log X dhi 


(15) 


^ ^ ' p{h^,Vi) 


2=1 ' 


= - - ^ f Q(hi) log ^7^ dhi + - f Q(hi) logp{vi \ hi) dhi 

” i=i '' ^ i=i 

-| n „ 1 

= - / Q{hi) \ogp{vi I hi) dh, -'V DKhiQihi) || p(h,)) . 


Z?kl denotes the Kullback-Leibler (KL) divergence 11341 . which is larger than or equal to zero. 

Algorithm S2 decreases ^ ^KhiQihi) || p{hi \ Vi)) (the E-step objective) in its E-step under 
constraints tor non-negative means and normalization. The constraint optimization problem from 
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Section S9.2 for the E-step is 


1 " 

min -y^DKhiQihi) || p{hi \ Vi)) 
Q{h,) n ^ 


s.t. Vi : > 0 


i=l 

The M-step of Algorithm]^ aims at decreasing 

£ = -V / Q{h,) log {p{vi I hi)) dh, . 

n 


( 16 ) 


(17) 


Algorithm [^performs one gradient descent step in the Newton direction to decrease £, while EM 
factor analysis minimizes S. 


Erom the modification of the E-step and the M-step follows that Algorithmj^is a Generalized Al¬ 
ternating Minimization (GAM) algorithm according to ED. GAM is an EM algorithm that increases 
F in the E-step and increases F in the M-step (see also Section [S7)i. The most important require¬ 
ments for the convergence of the GAM algorithm according to Theorem |7] (Proposition 5 in ED) 
are the increase of the objective F in both the E-step and the M-step. Therefore we first show these 
two decreases before showing that all requirements of convergence Theorem[7]are met. 


Algorithm [^ensures to decrease the M-step objective. The M-step objective £ is convex in W 
and according to Theorem|^and Theorem 
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The update with pw = — p = I leads to the 

The convexity of £ guarantees that each 


10 


minimum of £ according to Theorem and Theorem 
update with 0 < pw = ?7'i' = < 1 decreases the M-step objective £, except the current W and 

^-1 

are already the minimizers. 


Algorithm [^ensures to decrease the E-step objective. The E-step decrease of Algorithm [S^ is 
performed by Algorithm]^ According to Theorem[l4|the scaled projection with reduced matrix en¬ 
sures a decrease of the E-step objective for rectifying constraints (convex feasible set). According to 
Theorem[T^also gradient projection methods ensure a decrease of the E-step objective for rectifying 
constraints. Eor rectifying constraints and normalization, the feasible set is not convex because of 
the equality constraints. To optimize such problems, the generalized reduced gradient method lIlTll 
solves each equality constraint for one variable and inserts it into the objective. Eor our problem 
Eq. ( |160| l gives the solution and Eq. ( |161| l the resulting convex constraints. Now scaled projection 
and gradient projection methods can be applied. Eor rectifying and normalizing constraints, also 
Rosen’s ifTSi and Haug & Arora’s ifT^ gradient projection method ensures a decrease of the E-step 
objective since they can be applied to non-convex problems. 


We show that the requirements as given in Section[^for GAM convergence according to Theorem[7] 
(Proposition 5 in ED) are fulfilled; 


1. the learning rules, that is, the E-step and the M-step, are closed maps —> ensured by 
continuous and continuous differentiable maps, 

2. the parameter set is compact —ensured by bounding ’S' and W, 

3. the family of variational distributions is compact (often described by the feasible set of 
parameters of the variational distributions) —> ensured by continuous and continuous dif¬ 
ferentiable functions for the constraints and by the bounds on the variational parameters /r 
and S determined by bounds on the parameters and the data, 

4. the support of the density models does not depend on the parameter —ensured by Gaus¬ 
sian models with full-rank covariance matrix, 

5. the density models are continuous in the parameters —ensured by Gaussian models 

6. the E-step has a unique maximizer —> ensured by the convex, continuous, and continuous 
differentiable function that is minimized llT5ll3^ together with compact feasible set for the 
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variational parameters, the maximum may be local for non-convex feasible sets stemming 
from normalization, 

7. the E-step increases the objective if not at the maximizer —> ensured as shown above, 

8. the M-step has a unique maximizer (this is not required) —ensured by minimizing a 
convex, continuous and continuous differentiable function in the model parameter and a 
convex feasible set, the maximum is a global maximum, 

9. the M-step increases the objective if not at the maximizer —ensured as shown above. 


□ 

Since this Proposition 5 in IItTI is based on Zangwill’s generalized convergence theorem, updates 
of the REN algorithm are viewed as point-to-set mappings ll22l . Therefore the numerical precision, 
the choice of the methods in the E-step, and GPU implementations are covered by the proof. That 
the M-step has a unique maximizer is not required to proof Theorem|^by Theorem|7] However we 
obtain an alternative proof by exchanging the variational distribution Q and the parameters {W, ’4'), 
that is, exchanging the E-step and the M-step. A theorem analog to Theorem [7]but with E-step and 
M-step conditions exchanged can be derived from Zangwill’s generalized convergence theorem ll22l . 

The resulting model from the GAM procedure is at a local maximum of the objective given the model 
family and the family of variational distributions. The solution minimizes the KL-distance between 
the family of full variational distributions and full model family. “Eull” means that both the observed 
and the hidden variables are taken into account, where for the variational distributions the probability 
of the observations is set to 1. The desired family is defined as the set of all probability distributions 
that assign probability one to the observation. In our case the family of variational distributions is not 
the desired family since some distributions are excluded by the constraints. Therefore the solution 
of the GAM optimization does not guarantee stationary points in likelihood Gil. This means that 
we do not maximize the likelihood but minimize 

- T K. DKL(Q(/j-,'ft) \\p{h,v)) + c (18) 

according to Eq. ( |87l l, where c is a constant independent of Q and independent of the model param¬ 
eters. 


S4 Correctness Proofs for the RFN Learning Algorithms 


The REN algorithm is correct if it has a low reconstruction error and explains the data covariance 
matrix by its parameters like factor analysis. We show in Theoremj^and Theoremj^that the REN 
algorithm 

1. minimizes the reconstruction error given fii and S (the error is quadratic in ’i'); 

2. explains the covariance matrix by its parameters W and ’4' plus an estimate of the second 
moment of the coding units S. 


Since the minimization of the reconstruction error is based on /r^, the quality of reconstruction and 
covariance explanation depends on the correlation between and Vi. The larger the correlation 
between fii and Vi, the lower the reconstruction error and the better the explanation of the data 
covariance. We ensure maximal information in on Vi by the I-projection (the minimal Kullback- 
Leibler distance) of the posterior onto the family of rectified and normalized Gaussian distributions. 

The reconstruction error for given mean values fii is 


where 


1 

n 


i=l 


et = Vi - W fli . 


(19) 


( 20 ) 
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The reconstruction error for using the whole variational distribution Q{hi) instead of its means is 
’4'. Below we will derive Eq. ( |3T] i, which is 

^ = diag ef + . (21) 

Therefore \I/ is the reconstruction error for given mean values plus the variance WHW'^ introduced 
by the hidden variables. 


S4.1 Diagonal Noise Covariance Update 


Theorem 5 (REN Correctness: Diagonal Noise Covariance Update). The fixed point W minimizes 
Tr (4/) given pti and S by ridge regression with 


Tr(’®') 


1 

n 




w 


2 

5 

F 


( 22 ) 


where we used the error 


Ei = Vi - W Hi (23) 

The model explains the data covariance matrix by 

C ^ ^ + W SW^ (24) 

up to an error, which is quadratic in ^ for '5' <C . The reconstruction error 

n 

-E 11*^*112 (25) 

i—l 

is quadratic in for ’S' <C . 


Proof The fixed point equation for the W update is 

AW = U S-^ - W = 0 => W = U S-^ . (26) 

Using the definition of U and S, the fixed point equation Eq. ( |26l l gives 

W' = (; EfiMr + sj (27) 

Therefore W is a ridge regression estimate, also called generalized Tikhonov regularization esti¬ 
mate, which minimizes 


^ EII ^2 - Mzlla 


W 


2 

F 


1 

n 


Ell"'ll2 


+ 


W 


2 

F 


= -^ef E, + Tr (W 

^ i=l 



where we used the reconstruction error 


(28) 


Ei = Vi - W Hi ■ 


(29) 


12 














We obtain with this definition of the error 


- ^ ef 

n 

i=l 




( 30 ) 


^ n - n - n 

= - -- n, 

n n n 


2 = 1 


2=1 


2=1 


(31) 


(32) 


+ - w n, nJ + w i: w'^ 

2=1 

= c- uw^-wu^ + wsw'^. 

Therefore from the fixed point equation for with the diagonal update rule follows 

^ = diag ef + WS , 

where “diag” projects a matrix to a diagonal matrix. From this follows that 

Tr (4') = Tr ( - V ei ef + W 11 W'^] . 

/ 

Consequently, the fixed point W minimizes Tr ('5') given fii and S. 

After convergence of the algorithm S = (J + ^ holds. The Woodbury identity 

(matrix inversion lemma) states 

{WW^ + ^ ^W{I + (33) 

from which follows by multiplying the equation from right and left by ’®' that 

= W {I + W'^ (34) 

= ^ ^ {ww^ + ^ 

Inserting this equation Eq. ( [34l l into Eq. pT] ) gives 

'l " 


’4' = diag - ^ e, ef + 4^ - '4' (FT + ^) ^ 4^ 


(35) 


i 


= 4^ + diag I - ^ e, ef - 4^ (W + 4-) 4- 

Therefore we have 


diag ( - ^ e, ef - 4- (FF W’ 


4') ^4- = 0 . 


(36) 


2 = 1 


It follows that 


Tr ( - ^ e, ef ) = Tr (4- (W FF"^ + 4-) ^4- 
+ 4f)”^) Tr(4')^ . 


2=1 
< Tr( fW 


(37) 

The inequality uses the fact that for positive definite matrices A and B inequality Tr(AS) < 
Tr(A)Tr(B) holds ISl. Thus, for 4- < WW^ the error Tr (i e,ef) = 2 ef e, is 
quadratic in 4'. 


(38) 


Multiplying the fixed point equation Eq. ( [26l l by S gives U = WS. Therefore we have: 

wu'^ = w SW^ = uw'^. 
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Inserting Eq. ( (34| ) into the first line of Eq. ( [3Ql t and Eq. ( |38l ) for simplifying the last line of Eq. ( (3Q| ) 
gives 

1 ^ 

{WW'^ + ^ = C - - W SW'^ . (39) 

Using the trace norm (nuclear norm or Ky-Fan n-norm) on matrices, Eq. ([37]) states that the left hand 
side is quadratic in tF for ^ WW^. The trace norm of a positive semi-definite matrix is its 
trace and bounds the Frobenius norm ||23]| . Furthermore, Eq. ( [36| l states that the left hand side of 
this equation has zero diagonal entries. Therfore it follows that 

c = ^ + W SW^ (40) 

holds except an error, which is quadratic in for <C WW'^. The diagonal is exactly modeled 
according to Eq. ( [3^ . □ 

Therefore the model corresponding to the fixed point explains the empirical matrix of second mo¬ 
ments C by a noise part ’4' and a signal part W. Like factor analysis the data variance is 
explained by the model via the parameters ^ (noise) and W (signal). 


S4.2 Full Noise Covariance Update 


Theorem 6 (REN Correctness: Full Noise Covariance Update). The fixed point W minimizes 
Tr (T^) given fii and S by ridge regression with 


Tr(’®-) = - y] 


W 


where we used the error 

Ei = Vi - W Pi 

The model explains the data covariance matrix by 

c = ^ + w SW^ . 

The reconstruction error 

1 
n 

is quadratic in for ’S' < WW^. 


El 


(41) 

(42) 

(43) 

(44) 


Proof. The first part follows from previous Theorem]^ The fixed point equation for the ’S' update is 


<b = C- UW^-WU^ + WSW^, 

using Eq. ( |38| l this leads to 

c = ^ + wSW^ . 

From Eq. ( [30| l follows for the fixed point of ’S' with the full update rule: 

’S' = - V ei ef + W S . 
n 

Inserting Eq. ( (34| ) into Eq. ( |47) t gives 


’S' = - y]] ei ef -f T' - ’S' (W -f T') ^ T', 


from which follows 


= ’S' (SU -f ’S') ^ ’S'. 


2 = 1 


Thus, the error Tr (4 X]r=i — n ^^=1 quadratic in ’S', for T' <C WW^. 


(45) 

(46) 

(47) 

(48) 

(49) 

□ 
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S5 Maximum Likelihood Factor Analysis 

We are given the data {r;} = {vi ,..., which is assumed to be centered. Centering can be done 
by subtracting the mean fi from the data. The model is 

V = Wh + e , (50) 

where 

h ~ and e ~ A/'(0,^) . (51) 

The model includes the observations v G M™, the noise e G M™, the/actors h G the/actor 
loading matrix W G and the noise covariance matrix \I/ G Typically we assume 

that is a diagonal matrix to explain data covariance by signal and not by noise. The data variance 
is explained through a signal part IVh and through a noise part e. The parameters of the model 
are W and From the model assumption it follows that if h is given, then only the noise e is a 
random variable and we have 

v\h ^ M{Wh,^) . (52) 

We want to derive the likelihood of the data under the model, that is, the likelihood that the model 
has produced the data. Let E denote the expectation of the data including the prior distribution of 
the factors and the noise distribution. We obtain for the first two moments and the variance: 

E(u) = E(TF/i + e) = FFE(fi) + E(e) = 0 , (53) 

E(uu^) = E((TF/i + e){Wh + e)^) = 

WE {h h^) + WE (h) E (e^) 

+ E (e) E W^ + E (e e"^) = 

W W^ + ^ 

var(u) = E{v v^) - (E(u))^ = W W^ + . (54) 

The observations are Gaussian distributed since their distribution is the product of two Gaussian 
densities divided by a normalizing constant. Therefore, the marginal distribution for v is 

u ~ A/" (0 , WW"^ + . (55) 

The log-likelihood log n”=i of d^ta {u} under the model (W, ’®') is 

n n 

log]Jp(u,) = log [|(27r)"’"/"|WW^ + (56) 

exp (vf (WW^ + 

= _ !^ log(27r)- ^ log|WW^ + ^1 

1 ^ 

- 

^ i=l 

where |. | denotes the absolute value of the determinant of a matrix. 

To maximize the likelihood is difficult since a closed form for the maximum does not exists. There¬ 
fore, typically the expectation maximization (EM) algorithm is used to maximize the likelihood. 
For the EM algorithm a variational distribution Q is required which estimates the factors given the 
observations. 

We consider a single data vector Vi. The posterior is also Gaussian with mean {f^p)i and covariance 
matrix 

hi\vi ~ A/'((/Xp)i, Sp) (57) 

(/xp), = W^(W W^ + 

Sp = I - W^ (W W^ -f ’®')"^W, 
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where we used the fact that 


^ua = Cov(tt, a)and 'Eau = Cov{a,u) : 

CL \ XL ^ J\f (^/Xd -\- Edu^uu ("^ ^u) ; ^aa '^au^uu'^ua^ 


(58) 


and 


E{hv) = WE{hhF) = W . 

The EM algorithm sets Q to the posterior distribution for data vector Vi\ 

Qi{hi) = p{hi\ Vi\W,'^) = Af {{fXp)i,Ep) , 


(59) 


(60) 


therefore we obtain for standared EM 


Mi — {l^q)i — (Mp)i 

S = Eq = Sp . 


( 61 ) 

(62) 


The matrix inversion lemma (Woodbury identiy) can be used to compute /x^ and S: 


{WW'^ + <b) ^ . (63) 


Using this identity, the mean and the covariance matrix can be computed as: 


/X, = EE^ {WW'^ + 'E) ^ V, = {I + EE^’E-^EE) ^ (64) 

E = I - W'^ {W + 'E)”^EE = (/ + EE^’E-^EE)"^ . 


The EM algorithm maximizes a lower bound T on the log-likelihood: 

= logp('Ui) - £>kl(<3(^*) II p{hi I Vi)) 


(65) 



Dkl denotes the Kullback-Leibler (KL) divergence ll3^ which is larger than zero. 

is the EM objective which has to be maximized in order to maximize the likelihood. The E- 
step maximizes with respect to the variational distribution Q, therefore the E-step minimizes 
DKhiQih-i) II p{hi I Vi)). After the standard unconstrained E-step, the variational distribution is 
equal to the posterior, i.e. Q{hi) = p{hi \ Vi). Therefore the KL divergence 


DKL{Q{h^) II p{hi I Vi)) = 0 


( 66 ) 


is zero, thus T is equal to the log-likelihood logp(t)i) {F = logp(rti)). The M-step maximizes F 
with respect to the parameters (EE,’E), therefore the M-step maximizes f Q(hi) logp(vi | hi)dhi. 

We next consider again all n samples {rt} = {rti The expected reconstruction error S for 

these n data samples is 



and objective to maximize becomes 



( 68 ) 
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The M-step requires to minimize E: 


77? 1 

£: = -log (2^) + -log I'®'! + 


1 

2 n 


‘IL 

^EQ(('n, - Whif<b~^{v, - Wh, 


Tfi 1 

= -log(27r) + -log I’®'! + 


Eq + hfW'^^-^Wh,) 

2 n 

i—l 

1 1 ^ 

= yog(27r) + -log I’®'! + 

i—l 

- TT(^^-^W^EQ{h,)vf'^ + ^TT(w^^-^Wj2^Q{h,hJ 

1 1 / 1 ^ 

= ylog(27r) + -log I’®'! + -Tr ^ 


Z=1 


/ 1 i 

- Tr y fi. 


\ i=i / 

'^f] + lTr(w^^-^wl E(S + 


2 = 1 


2 = 1 


= - (m log(27r) + logl^'l + Tr (’®' ^C) 

- 2Tr(®'^iTyj7^) + Tr(TE'^’®'-iTES')) , 
where Tr gives the trace of a matrix. 

The derivatives with respect to the parameters are set to zero for the optimal parameters: 

n n 


and 




2 n 


1 

2 n 


■ y Eg (-n, - Wh,) (v, - Tr/r,f’®'”i) = 0. 

2=1 

Solving above equations gives: 




-1 


2=1 


2 = 1 


and 


1 ^ 

= -EEq(K - = 

^ 2 = 1 

^ n 1 ^ 

-yu.y - iyt;,EyM(TE“y - 

n n ^ 

i—l i—l 

n 1 ^ 

Tl. Tl. 

2=1 2=1 


( 69 ) 


( 70 ) 


( 71 ) 


( 72 ) 


( 73 ) 


( 74 ) 


( 75 ) 


( 76 ) 


( 77 ) 
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We obtain the following EM updates: 

E-step: (78) 

/X, = (j + V,, 

s = (/ + , 

Eq {hi) = fii 
Eq {hi hj) = Hr ty + S 

M-step: (79) 

. n 1 ^ 

^new ^ ^ T _ _ ^ (p^.new)T _ 

2=1 2=1 

- n 1 ^ 

-J2w'^^'^EQ{h,)vf + EE-W-^Eq {h, hJ) . 

2=1 2=1 

The EM algorithms can be reformulated as: 

E-step: (81) 

/X, = (I + w'^^-^wy^ V, , 

s = (I + w'^^-^wy^, 

Eq (^ 2 ) = t^i 
Eg (h., hf) = ^J,^^y + s 



(82) 

(83) 

(84) 

(85) 

( 86 ) 


S6 The RFN Objective 

Our goal is to find a sparse, non-negative representation of the input which extracts structure from 
the input. A sparse, non-negative representation is desired to code only events or objects that have 
caused the input. We assume that only few events or objects caused the input, therefore, we aim 
at sparseness. Eurthermore, we do not want to code the degree of absence of events or objects. As 
the vast majority of events and objects is supposed to be absent, to code for their degree of absence 
would introduce a high level of random fluctuations. 

We aim at extracting structures from the input, therefore generative models are use as they explicitly 
model input structures. Eor example factor analysis models the covariance structure of the data. 
However a generative model cannot enforce sparse, non-negative representation of the input. The 
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input representation of a generative model is the posterior’s mean, median, or mode. Generative 
models with rectified priors (zero probability for negative values) lead to rectified posteriors. How¬ 
ever these posteriors do not have sparse means (they must be positive), that is, they do not yield 
sparse codes m. For example, rectified factor analysis, which rectifies Gaussian priors and selects 
models using a variational Bayesian learning procedure, does not yield posteriors with sparse means 
iMiini- A generative model with hidden units h and data v is defined by its prior p{h) and its 
likelihood p{v \ h). The posterior p{h \ v) supplies the input representation of a model by the pos¬ 
terior’s mean, median, or mode. However, the posterior depends on the data v, therefore sparseness 
and non-negativity of its means cannot be guaranteed independent of the data. Problem at coding 
the input by generative models is the data-dependency of the posterior means. 

Therefore we use the posterior regularization method (posterior constraint method) lElIlSlIlol. 
The posterior regularization framework separates model characteristics from data dependent char¬ 
acteristics like the likelihood or posterior constraints. Posterior regularization incorporates data- 
dependent characteristics as constraints on model posteriors given the observed data, which are 
difficult to encode via model parameters by Bayesian priors. 

A generative model with prior p(/r) and likelihood | h) has the full model distribution p(/r, rt) = 
p{v I h)p{h). It can be written as pih, v) = p{h \ v)p(v), where p{h \ v) is the model posterior 
of the hidden variables and p{v) is the evidence, that is, the likelihood of the data to be produced 
by the model. The model family and its parametrization determines which structures are extracted 
from the data. Typically the model parameters enter the likelihood p{v \ h) and are adjusted to the 
observed data. For the posterior regularization method, a family Q of allowed posterior distributions 
is introduced. Q is defined by the expectations of constraint features. In our case the posterior means 
have to be non-negative. Distributions Q G Q are called variational distributions (see later for using 
this term). The full variational distribution is Q{h,v) = Q{h \ v)py{v) with Q{h | rt) G Q. The 
distribution p„ (v) is the unknown distribution of observations as determined by the world or the data 
generation process. This distribution is approximated by samples drawn from the world, namely the 
training samples. p{h, v) contains all model assumptions like the structures used to model the data, 
while Q{h, v) contains all data dependent characteristics including data dependent constraints on 
the posterior. 

The goal is to achieve Q{h,v) = p{h,v), to obtain (1) a desired structure that is extracted from 
the data and (2) desired code properties. However in general it is to achieve this identity, therefore 
we want to minimize the distance between these distributions. We use the Kullback-Leibler (KL) 
divergence ll34ll Z?kl to measure the distance between these distributions. Therefore our objective 
is DKhiQih, v) II p{h, v)). Minimizing this KL divergence (1) extracts the desired structure from 
the data by increasing the likelihood, that is, Pv{v) « p{v), and (2) enforces desired code properties 
by Q{h I v) « p{h \ v). Thus, the code derived from Q{h \ v) has the desired properties and t 
extracts the desired input data structures. 

We now approximate the KL divergence by approximating the expectation over (rt) by the empir¬ 
ical mean of samples {v} = {vi,..., drawn from Pv{v): 


DKhiQih, v) |j p{h, v)) = [ Q{h, v) log dh dv 

J P{h,v) 

= [ Pviv) [ Q{h\v) log ^^j^^dhdv 
Jv Jh p{h,v) 


- y] / Q{h I Vi) log 


Q{h,Vi) 
pih, Vi) 


dh 


I Qih I Vi) log ^ dh + - Y^^^SPvivi) ■ 
^ pih, Vi) n ^ 


IH 


(87) 


The last term ^ Y!h=i ^ogpvivi) neither depends on Q nor on the model, therefore we will neglect 
it. In the following, we often abbreviate Qih \ Vi) by Qihi) or write Qihi \ Vi), since the hidden 
variable is based on the observation Vi. Similarly we often write pihi, Vi) instead of pih, Vi) and 
even more often | Vi) instead of pih \ Vi). 
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We obtain the objective T (to be maximized) of the posterior constraint method |[l2l[39l|40l: 


^ n ^ n 

^ = - y'logp(Vi)-V£>KL(Q(/li) \\pih, I Vi)) 

n n 


( 88 ) 


2=1 


2=1 


= - X! / logp(t;,) dh, -^ / Q{hi) log 


2=1 ' 


2=1 ■ 


Q{hi) 

p{hi I Vi) 


dhi 


= " “ £ / log 


Qjhj) 

p(h„Vi) 


dhi 


nonumber = “ “ X! / log ~ f logP(w I ^0 (^9) 

-| n „ 1 ^ 

= - V / Q(^») logp(w I h.*) dh, - '^DKhiQiK) II pih,)) . 

n I n 

A —1 —1 


The first line is the negative objective of the posterior constraint method while the third line is the 
negative Eq. ( |87] l without the term ^ Y!d=i ^ogPv{vi). 

T is the objective in our framework which has to be maximized. Maximizing (1) increases the 
model likelihood ^ ^ogp{vi), (2) finds a proper input representation by small DKh{Q{hi) || 
p{hi I Vi)). Thus, the data representation (1) extracts structures from the data as imposed by the 
generative model while (2) ensuring desired code properties via Q G Q. 

In the variational framework, Q is the variational distribution and iF is called the negative/ree energy 
ED. This physical term is used since variational methods were introduced for quantum physics by 
Richard Feynman ll42l . The hidden variables can be considered as the hctive causes or explanations 
of environmental fluctuations ll43l . 

If p{h \ v) G Q, then Q{h \ v) = p{h \ v) and we obtain the classical EM algorithm. The EM 
algorithm maximizes the lower bound F on the log-likelihood as seen at the first line of Eq. ( [SS] ) 
and ensures in its E-step Q{h \ v) = p{h \ v). 


S7 Generalized Alternating Minimization 


Instead of the EM algorithm we use the Generalized Alternating Minimization (GAM) algorithm ETIl 
to allow for gradient descent both in the M-step and the E-step. The representation of an input by 
a generative model is the vector of the mean values of the posterior, that is, the most likely hidden 
variables that produced the observed data. We have to modify the E-step to enforce variational 
distributions which lead to sparse codes via zero values of the components of its mean vector. Sparse 
codes, that is, many components of the mean vector are zero, are obtained by enforcing non-negative 
means. This rectihcation is analog to rectified linear units for neural networks, which have enabled 
sparse codes for neural networks. Therefore the variational distributions are restricted to stem from a 
family with non-negative constraints on the means. To impose constraints on the posterior is known 
as the posterior constraint method Uni EH Ho). The posterior constraint method maximizes the 
objective both in the E-step and the M-step. The posterior constraint method is computationally 
infeasible for our approach, since we assume a large number of hidden units. For models with many 
hidden units, the maximization in the E-step would take too much time. The posterior constraint 
method does not support fast implementations on GPUs and stochastic gradients, which we want to 
allow in order to use mini-batches and dropout regularization. 

Therefore we perform only one gradient descent step both in the E-step and in the M-step. Unfor¬ 
tunately, the convergence proofs of the EM algorithm are no longer valid. However we show that 
our algorithm is a generalized alternating minimization (GAM) method. Gunawardana and Byrne 
showed that the GAM converges ETIl (see also EH). 

The following GAM convergence Theorem|7]is Proposition 5 in ETl and proves the convergence of 
the GAM algorithm to a solution that minimizes —F. 
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Theorem 7 (GAM Convergence Theorem). Let the point-to-set map FB the composition B o F o/ 
point-to-set maps F :I?x0—>I?x0 and B:I?x0—>I?x0. Suppose that the point-to-set 
maps F and B are defined so that 

(1) F and B are closed on D' x 0 

(2) F{V' x&)CVxeand B{V' x 0) C x 0 

Suppose also that F is such that all (Q'x, d') € F(Qjf) d) have O' = 9 and satisfy 
(GAM.F): Dkl{Q'x II Pxm) < Dkl{Qx |i Pxm) 

with equality only if 

(EQ.F): Qx = arg min DKhiQx II Px-e) , 

with Qx being the unique minimizer. Suppose also that the point-to-set map B is such that all 
{Q'x: (^') G B((5x, d) have Q'x = Qx ond satisfy 

(GAM.B): Dkl{Qx || Px-e') < D-x'l{Qx || Px-e) 

with equality only if 

(EQ.B): 9 G argmini:)KL(Qx || Px-^) ■ 

Then, 

(1) the point-to-set map FB is closed on F' x & 

(2) FB{V' X 0) C X 0 

and FB satisfies the GAM and EQ conditions of the GAM convergence theorem, that is. Theorem 3 
in 

Proof See Proposition 5 in ETll . □ 

The point-to-set mappings allow extended E-step and M-steps without unique iterates. Therefore, 
Theorem Tjholds for different implementations, different hardware, different precisions of the algo¬ 
rithm under consideration. 

For a GAM method to converge, we have to ensure that the objective increases in both the E-step 
and the M-step. Q is from a constrained family of variational distributions, while the posterior and 
the full distribution (observation and hidden units) are both derived from a model family. The model 
family is a parametrized family. For our models (i) the support of the density models does not depend 
on the parameter and (ii) the density models are continuous in their parameters. GAM convergence 
requires both (i) and (ii). Furthermore, both the E-step and the M-step must have unique maximizers 
and they increase the objective if they are not at a maximum point. 

The learning rules, that is, the E-step and the M-step are closed maps as they are continuous func¬ 
tions. The objective for the E-step is strict convex in all its parameters for the variational distribu¬ 
tions, simultaneously Il35ll3^ . It is quadratic for the mean vectors on which constraints are imposed. 
The objective for the M-step is convex in both parameters W and (we sometimes estimate 
instead of tl/^^). The objective is quadratic in the loading matrix W. For rectifying only, we 
guarantee unique global maximizers by convex and compact sets for both the family of desired dis¬ 
tributions and the set of possible parameters. For this convex optimization problem with one global 
maximum. For rectifying and normalizing, the family of desired distributions is not convex due 
to equality constraints introduced by the normalization. However we can guarantee local unique 
maximizers. 

Summary of the requirements for GAM convergence Theorem]^ 

1. the learning rules, that is, the E-step and the M-step, are closed maps, 

2. the parameter set is compact. 
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3. the family of variational distributions is compact (often described by the feasible set of 
parameters of the variational distributions), 

4. the support of the density models does not depend on the parameter, 

5. the density models are continuous in the parameters, 

6. the E-step has a unique maximizer, 

7. the E-step increases the objective if not at the maximizer, 

8. the M-step has a unique maximizer (not required by Theorem]^, 

9. the M-step increases the objective if not at the maximizer. 

The resulting model from the GAM procedure is at a local maximum of the objective given the model 
family and the family of variational distributions. The solution minimizes the KL-distance between 
the family of full variational distributions and full model family. “Eull” means that both the observed 
and the hidden variables are taken into account, where for the variational distributions the probability 
of the observations is set to 1. The desired family is defined as the set of all probability distributions 
that assign probability one to the observation. In our case the family of variational distributions is not 
the desired family since some distributions are excluded by the constraints. Therefore the solution 
of the GAM optimization does not guarantee stationary points in likelihood IItTI . This means that 
we do not maximize the likelihood but minimize the KL-distance between variational distributions 
and model. 

S8 Gradient-based M-step 

S8.1 Gradient Ascent 

The gradients in the M-step are; 



and 


1 1 

- WK) (u, - . (90) 


Alternatively, we can estimate ^ ^ which leads to the derivatives; 



(91) 


Scaling the gradients leads to; 



(92) 


and 


2 


(93) 
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or 


(94) 


2 V^-i^ = 


/ n ^ n 

\ n n ^ 

\ i—l i—l 

n n \ 

- -E^Eq(/i,) vf + W-EEq {KhJ)W^\ 

i=l i=l / 


Only the sums 

n 

^ = - 

‘ i=l 

and 

1 ” 

S = - T.^Q{h.hf) 

i=l 

must be computed for both gradients. 


1 


C = -^v,vl 

71 


Z=1 


is the estimated covariance matrix (matrix of second moments for zero mean). 

The generalized EM algorithm update rules are: 


(95) 


(96) 


(97) 


C 

E-step: 

S 

Eg {hi) 
Eg {h, hf) 

U 

s 




(98) 


(W W'^ + ^) ^ V, = {I + ^ Vi , 

I - w'^ {ww'^ + = (/ + , 

11, nJ + s 

1 " 

- E^»®q 

n ^ 

i—\ 

n 

- EEg {h,hj) 


M-step: (99) 

AW = U - W S 

A<lf = {C - UW^ - WU + W S Wy . 


S8.2 Newton Update 

Instead of gradient ascent, we now consider a Newton update step. The Newton update for finding 
the roots of is 

OV 

Vn+I = v„ - ri H~^ V„f{v„) , (100) 

where p is a small step size and H is the Hessian of / with respect to v evaluated at u„. We denote 
the update direction by 

Av = - Tf-i V,/(u„) . (101) 
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S8.2.1 Newton Update of the Loading Matrix 


Theorem 8 (Newton Update for Loading Matrix). The M-step objective £ is quadratic in IV, thus 
convex in W. The Newton update direction for W in the M-step is 

AW = U S-^ - W . (102) 

Proof. The M-step objective is the expected reconstruction error £, which is according to Eq. ( |69l l 

1 ^ /* 1 

£ = -V' / Qih,) log(p(ni I hi)) dhi = log (27r) -f log|'5'| (103) 

n 2V 

-f Ti{'V-^C) - 2 Tt{^-^WU'^) + Tr(W^T'-iW5)) , 

where Tr gives the trace of a matrix. This is a quadratic function in W, as stated in the theorem. 

The Hessian Hw of (2£) with respect to VU as a vector is; 

_ dvec (2 \/w£) _ dvec (- U + W S) 

^ c)vec(W)^ c)vec(W)^ 

= S 

where ® is the Kronecker product of matrices. Hw is positive definite, thus the problem is convex 
in W. The inverse of Hw is 

= 5-1 0 . (105) 

For the product of the inverse Hessian with the gradient we have: 

vec (- U + S) = vec (- U + W S) S-^) (106) 

= vec (- U S'-! + W) . 

If we apply a Newton update, then the update direction for W in the M-step is 

AW = U S-^ - W . (107) 

□ 


This is the exact EM update if the step-size 77 is 1. Since the objective is a quadratic function in W, 
one Newton update would lead to the exact solution. 

S8.2.2 Newton Update of the Noise Covariance 

We define the expected approximation error by 

E = C- UW^-WU + WSW^ (108) 

= - (K - Wh,) (n, - Wh,)^^ . 


’®' as parameter. 

Theorem 9 (Newton Update for Noise Covariance). The Newton update direction for 9/ as param¬ 
eter in the M-step is 

A’S' = E; - . (109) 

An update with A’S' (77 = 1) leads to the minimum of the M-step objective £. 

Proof The M-step objective is the expected reconstruction error £, which is according to Eq. ( |69l l 
1 ^ /* 1 

£ = -/ Qih,) log {p{vi I hi)) dhi = -(m log (27r) -f log|'5'| (110) 

-f Tr('5'-iC) - 2Ti {^-^WUy + Tr W5) ) , 
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where Tr gives the trace of a matrix. 

Since 

2V^,£’ = , (111) 

is 


^ = E 


( 112 ) 


the minimum of S with respect to ’4'. Therefore an update with A’4' = — ’4' leads to the minimum. 


The Hessian H-g, of (2£) with respect to 4^ as a vector is: 

c)vec( 2 V^£’) 9vec (4>”^ - ’4'“^£;’4'^^) 

^ 9vec(’4')'^ i9vec(’4')^ 

= - (g) (g (’4'“^£;4f“^) + (g . 


(113) 


The expected approximation error E is a sample estimate for '4', therefore we have ’4' « £J. The 
Hessian may not be positive definite for some values of E, like for small values of E. In order to 
guarantee a positive definite Hessian, more precisely an approximation to it, for minmization, we set 

E = ^ (114) 

and obtain 

g 4^-1 . (115) 

We derive an approximate Newton update that is very close to the Newton update. 

The inverse of the approximated is 

= 4^ g 4- . (116) 


For the product of the inverse Hessian with the gradient we have: 

ff^Nec(4'-i - <lf = vec(4' (4--^ - (117) 

= vec (4/ — FI) . 


If we apply a Newton update, then the update direction for 4/ in the M-step is 

A4' = F; - 4- . (118) 

This is the exact EM update if the step-size p is 1. □ 

4> ^ as parameter. 

Theorem 10 (Newton Update for Inverse Noise Covariance). The M-step objective S is convex in 
4r^^. The Newton update direction for 4/^^ as parameter in the M-step is 

A4'-i = 4'-! - 4'-1 f;4'-i. (119) 

A first order approximation of this Newton direction for 4^ in the M-step is 

A4' = F; - 4- . (120) 

An update with A4' (rj = 1) leads to the minimum of the M-step objective £. 

Proof. The M-step objective is the expected reconstruction error S, which is according to Eq. ( [69l l 
1 ^ /* 1 

£ = -^ / Q{hi) log(p(i!i I hi)) dhi = 7 . log (27r) -f log|4'| (121) 

n 2V 

-f Tr(4'-iC) - 2Tr(4'-iWi7^) -f Tr (W^4'-i W5) ) , 

where Tr gives the trace of a matrix. 
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Since 


2 Vq,^i£ = - ^ + E 


( 122 ) 


is 


^ = E 


(123) 


the minimum of £ with respect to Therefore an update with A^' = E — leads to the 

minimum. 


The Hessian Hq,-i of {2£) with respect to ^ ^ as a vector is: 

(9vec (2 V^-i5) c)vec(—+ E) 


’3> (g) fl>. 


(124) 


Since the Hessian is positive definite, the E-step objective £ is convex in ’S' which is the first 
statement of the theorem. 


The inverse of H^-i is 

= 'I' ^ 0 ’S'-l . 


(125) 


For the product of the inverse Hessian with the gradient we have: 

H-l, vec(- S' + E;) = vec(S'-i (- S' + E)^-^) (126) 

= vec(-S'-i + S'-^ElS'-i) . 


If we apply a Newton update, then the update direction for ’S' ^ in the M-step is 

AS'-i = S'-i - S'-^ElS'-i. (127) 


We now can approximate the update for ’S' by the first terms of the Taylor expansion: 


S' -f AS' = (S'"^ + AS'-^) ^ « S' - S' AS'^^ S' . (128) 

We obtain for the update of S' 

AS' = - S' AS'-^ S' = E; - S' . (129) 

This is the exact EM update if the step-size p is 1. □ 

The Newton update derived from ’S'“^ as parameter is the Newton update for S'. Consequently, the 
Newton direction for both S' and S'~^ is in the M-step 

AS' = E; - S' . (130) 


S9 Gradient-based E-Step 

S9.1 Motivation for Rectifying and Normalization Constraints 

The representation of data vector v by the model is the variational mean vector fig. In order to obtain 
sparse codes we want to have non-negative fig. We enforce non-negative mean values by constraints 
and optimize by projected Newton methods and by gradient projection methods. Non-negative 
constraints correspond to rectifying in the neural network field. Therefore we aim to construct 
sparse codes in analogy to the rectified linear units used for neural networks. 

We constrain the variational distributions to the family of normal distributions with non-negative 
mean components. Consequently we introduce non-negative or rectifying constraints: 

fi > 0 , (131) 

where the inequality “>” holds component-wise. 
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However generative models with many coding units face a problem. They tend to explain away small 
and rare signals by noise. For many coding units, model selection algorithms prefer models with 
coding units which do not have variation and, therefore, are removed from the model. Other coding 
units hardly contribute to explain the observations. The likelihood is larger if small and rare signals 
are explained by noise, than the likelihood if coding units are use to explain such signals. Coding 
units without variance are kept on their default values, where they have maximal contribution to the 
likelihood. If they are used for coding, they deviate from their maximal values for each sample. 
In accumulation these deviations decrease the likelihood more than it is increased by explaining 
small or rare signals. For our RFN models the problem can become severe, since we aim at models 
with up to several tens of thousands of coding units. To avoid the explaining away problem, we 
enforce the selected models to use all their coding units on an equal level. We do that by keeping the 
variation of each noise-free coding unit across the training set at one. Consequently, we introduce a 
normalization constraint for each coding unit 


1 

n 



= 1 . 


(132) 


This constraint means that the noise-free part of each coding unit has variance one across samples. 


We will derive methods to increase the objective in the E-step both for only rectifying constraints 
and for rectifying and normalization constraints. These methods ensure to reduce the objective in 
the E-step to guarantee convergence via the GAM theory. The resulting model from the GAM pro¬ 
cedure is at a local maximum of the objective given the model family and the family of variational 
distributions. The solution minimizes the KL-distance between the family of full variational distribu¬ 
tions and full model family. “Eull” means that both the observed and the hidden variables are taken 
into account. 


S9.2 The Full E-step Objective 

The E-step maximizes F with respect to the variational distribution Q, therefore the E-step min¬ 
imizes the Kullback-Leibler divergence (KL-divergence) lf34l DKL{Q{h) || p{h \ u)). The KL- 
divergence between Q and p is 

Dkl{Q II p) = [ Q{h) log dh . (133) 

J P{h I V) 

Rectifying constraints introduce non-negative constraints. The minimization with respect to Q{hi) 
gives the constraint minimization problem: 

1 "■ 

min - ^ || p{hi \ Vi)) (134) 

Q h, n ^ 

1 — 1 

s.t. Vi : /Xi > 0 , 
where pi is the mean vector of Qfhi). 

Rectifying and normalizing constraints introduce non-negative constraints and equality constraints. 
The minimization with respect to Q{hi) gives the constraint minimization problem: 

1 " 

min - || p{hi \ Vi)) (135) 

Q(h, n-^ 

1 — 1 

s.t. Vi : /Xi > 0 , 

^ 

i=l 

where /Xi is the mean vector of Qfhf). 

Eirst we consider the families from which the model and from which the variatio nal d istributions 
stem. The posterior of the model with Gaussian prior p(/x) is Gaussian (see Section [S5] l: 

p{h I v) - (27r)"5 ISpI"® exp ^ “ Mp)’" ^p ^ “ Mp)^ ■ (136) 
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To be as close as possible to the posterior distribution, we restrict Q to be from a Gaussian family: 


Q{h) ~ (27r) 2 |S,| 2 exp 


s-i {h 



(137) 


For Gaussians, the Kullback-Leibler divergence between Q and p is 

Dkl(Q II P) = (138) 

^ |Tr(S;i S,) + (/xp - (;xp - /X,) - ( - In^l . 

This Kullback-Leibler divergence is convex in the mean vector /x^ and the covariance matrix of 
Q, simultaneously ll35l[3^ . 

We now minimize Eq. ( |138| l with respect to Q. For the moment we do not care about the constraints 
introduced by non-negativity and by normalization. Eq. P38[ ) has a quadratic form in fiq, where Hq 
does not enter, and terms in where fiq does not enter. Therefore we can separately minimize for 
Sq and for /x^. 

For the minimization with respect to Hq, we require 

^Tr(S;iS,) = (139) 

and 

^ln|S,|=S-^. (140) 

For optimality the derivative of the objective DxhiQ |1 p) with respect to must be zero: 

^Z7kl(Q II p) = ^ _ i = 0 . (141) 

This gives 


S — Eg — Sp . 

We often drop the index q since for 1 < x < n all covariance matrices Hq are equal to Sp. 
The mean vector fiq of Q is the solution of the minimization problem: 

min i (/Xp - p.f' Sp ^ (/Xp - /x) 

fJj ^ 

which is equivalent to 

mm i/x^Sp V - ^JSpV- 


(142) 


(143) 


(144) 


The derivative and the Hessian of this objective is: 

d 


^DKhiQ II p) = Ep (/X - /Xp) , 
^Dkl{Q\\p) = Sp-^ 


(145) 

(146) 


S9.3 E-step for Mean with Rectifying Constraints 
S9.3.1 The E-Step Minimization Problem 

Rectifying is realized by non-negative constraints. The mean vector fiq of Q is the solution of the 
minimization problem: 

mm ^ (m - P'pf Sp ^ (/X - /Xp) (147) 

s.t. /X > 0 . 
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This is a convex quadratic minimization problem with non-negativity constraints (convex feasible 
set). 

If A is the Lagrange multiplier for the constraints, then the dual is 

nun ^ A^SpA + /rjA (148) 

s.t. A > 0 . 

The Karush-Kuhn-Tucker conditions require for the optimal solution for each component 

Xj Hj = 0. (149) 

Further the derivative of the Lagrangian with respect to /r gives 

Sp V - - A = 0 (150) 

which can be written as 

— fip — Yip X — 0 . (151) 


This minimization problem cannot be solved directly. Therefore we perform a gradient projection 
or projected Newton step to decrease the objective. 


S9.3.2 The Projection onto the Feasible Set 


To decrease the objective, we perform a gradient projection or a projected Newton step. We will 
base our algorithms on Euclidean least distance projections. If projected onto convex sets, these 
projections do not increase distances. The Euclidean projection onto the feasible set is denoted by 
P, that is, the map that takes fip to its nearest point p (in the L^-norm) in the feasible set. 

For rectifying constraints, the projection P (Euclidean least distance projection) of Pp onto the 
convex feasible set is given by the solution of the convex optimization problem: 


1 T 

mm - {p - Up) in - ptp) 


(152) 


M 2 

s.t. /X > 0 . 

The following Theorem 11 shows that update Eq. ( [T5^ is the projection P defined by optimization 
problem Eq. ( 152| l. 

Theorem 11 (Projection: Rectifying). The solution to optimization problem Eq. ( |152| i, which defines 
the Euclidean least distance projection, is 


Pj = [P(/Xp)]^. 


0 for inp}j ^ 0 
ib'p)j > 0 


(153) 


Proof. For the projection we have the minimization problem: 

1 T 

njin 2 ~ ~ 

s.t. /X > 0 . 

The Lagrangian L with multiplier A > 0 is 

L = ^ (/X - /Xp)”^ (/X - fip) - X^ p. 

The derivative with respect to /x is 

dL 

— - n-Pp-X-o. 


(154) 


(155) 


(156) 


The Karush-Kuhn-Tucker (KKT) conditions require for the optimal solution that for each constraint 

j- 


^3 Tj — 0 . 


(157) 
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If 0 < {np)j then Eq. ( |156| l requires 0 < /r , be cause the Lagrangian Xj is larger than or equal to zero: 
0 < Xj. From the KKT conditions Eq. ( |157| i follows that Xj = 0 and, therefore, 0 < fxj = {np)j. 
If (/ip)j < 0 then 0 < fj,j — {iJ.p)j, because the constraints of the primal problem require 0 < fij. 
From Eq. ( |156| l follows that 0 < A^. From the KKT conditions Eq. ( |157| l follows th at (p.p ) j = 0 
and 0 < Aj = — (/rp)j. If (/ip)j = 0, then Eq. ( |156| ) and the KKT conditions Eq. ( |157( i lead to 

(P'p)j = fJ-j = Xj = 0. 

Therefore the solution of problem Eq. (|152|l is 


J {fJ-p)j for {^ip)j > 0 and Xj = 0 
1 0 for {pp)j < 0 and Xj = — ipp)j 


(158) 


This finishes the proof. 


□ 


S9.4 E-step for Mean with Rectifying and Normalizing Constraints 
S9.4.1 The E-Step Minimization Problem 

If we also consider normalizing constraints, then we have to minimize all KL-divergences simulta¬ 
neously. The normalizing constraints connect the single optimization problems for each sample Vi. 
For the E-step, we obtain the minimization problem: 

1 " 

min - - [p-p)if Sp ^ - {Pp)i) (159) 

i—1 

1 "" 

s.t. V, : pt, > 0 , Vj : ^ ^ = 1 . 

The “ >”-sign is meant component-wise. The I equality constraints lead to non-convex feasible sets. 
The solution to this optimization problem are the means vectors (li of Q{hi). 


Generalized Reduced Gradient. The equality constraints can be solved for one variable which is 
then inserted into the objective. The equality constraint gives for each 


i=2 \ i=2 

These equations can be inserted into the objective and, thereby, we remove the variables fiij. We 
have to ensure that the /iij exist by 

n 

'^hij < n. (161) 

1=2 

These constraints define a convex set feasible set. To solve the each equality constraints for a variable 
and insert it into the objective is called generalized reduced gradient method im. For solving the 
reduced problem, we can use methods for constraint optimization were we now ensure a convex 
feasible set. These methods solve the original problem Eq. ( |159| l. We only require an improvement 
of the objective with a feasible value. For the reduced problem, we perform one step of a gradient 
projection method. 

Gradient Projection Methods. Also for the original problem Eq. ( |159| l, gradient projection meth¬ 
ods can be used. The gradient projection method has been generalized by Rosen to non-linear con¬ 
straints ca and was later improved by mi. The gradient projection algorithm of Rosen works for 
non-convex feasible sets. The idea is to linearize the nonlinear constraints and solve the problem. 
Subsequently a restoration move brings the solution back to the constraint boundaries. 


S9.4.2 The Projection onto the Feasible Set 

To decrease the objective, we perform a gradient projection, a projected Newton step, or a step of the 
generalized reduced method. We will base our algorithms on Euclidean least distance projections. 
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If projected onto convex sets, these projections do not increase distances. The Euclidean projection 
onto the feasible set is denoted by P, that is, the map that simultaneously takes {(/Xp)i} to the nearest 
points (in the L^-norm) in the feasible set. 


For rectifying and normalizing constraints the projection (Euclidean least distance projection) of 
{(/i,p)i} onto the non-convex feasible set leads to the optimization problem 


min 

Mi 


1 

n 


2=1 






(162) 


s.t. Vi : /Ti > 0 , 



Z = 1 


By using (/r* - = /xf/ri-2/rf (/Xp)i + (/Xp)f (/Xp)i, we see that the objective 

contains the sum l^ij- The constraints enforce this sum to be constant. Therefore inserting the 
equality constraints into the objective, optimization problem Eq. (|162|l is equivalent to 


min 

Mi 


1 " 

~ 'y ' Mi 


(163) 


s.t. Vi : /Xi > 0 , 



i=l 


The following Theorem [T2] shows that updates Eq. ( |164| i and Eq. ( |165| l form the projection defined 
by optimization problem Eq. ( |162| l. 

Theorem 12 (Projection: Rectifying and Normalizing). If at least one {np)ij is positive for 1 < j < 
I, then the solution to optimization problem Eq. ( |162| l, which defines the Euclidean least distance 
projection, is 


f 0 for (pp)y < 0 


(164) 


Pif “ P((Mp)i)]j 



If all {P'p)ij are non-positive for 1 < j < I, then the optimization problem Eq. ( fT6^ has the solution 


( for j = argmax 3 .{(pp)-.} 

0 otherwise 


(165) 


Proof. In the following we show that updates Eq. ( |164| i and Eq. ( |164| i are the projection onto the 
feasible set. For the projection of {{pLp)i] onto the feasible set, we have the minimization problem: 


min 

Mi 


1 

n 


II (Mi 

2=1 


(Mp)i) (Mi 


{P’p)i) 


(166) 


s.t. Vi : /Ti > 0 , 


i=l 

The feasible set is non-convex because of the quadratic equality constraint. The Lagrangian with 
multiplier A > 0 is 

^ n n 

T = — E (Mi ~ (Mp)i) (Mi ~ iP'p)i) ~ E "^i (167) 

+ Eii (i pi, - 1) ^ 
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The Kamsh-Kuhn-Tucker (KKT) conditions require for the optimal solution: 


\ij ^ij = 0 and 



i=l 



= 0 . 


( 168 ) 


- Tj /iy = 0 . (169) 

n 

We multiply this equation by and obtain: 

2 2 

^ if^ij ~ f^ij) ~ t^ij + ^ T? ~ ® ' (1^0) 

The KKT conditions give Xij^ij = 0, therefore this term can be removed from the equation. Next 
we sum over i: 


The derivative of L with respect to is 


dL 

5/ry 


— ilJ-ij {lJ-p)ij) Xij + 


2 '£• 2 

i—1 i—1 

Using the equality constraint 1 /n J2'i-i l^ij — 1 dividing by 2 and gives: 


1 - ^ ^ • 


i=l 


Solving for tj leads to: 


— '^^[l^p)ij ^'‘ij 1 ■ 


(171) 


(172) 


(173) 


We insert tj into Eq. (|169[) 


{f^p)ij 





s=l 


^ij — 0 ■ 


(174) 


We immediately see, that if = 0 then {iJ.p)ij = — f A^- < 0. Therefore we can assume /i^ > 0. 

Multiplying Eq. (|174|l with and using the KKT conditions gives 





fJ^sj 


2 


= 0 . 


(175) 


Therefore and A X)r=i(A*p)siMsj have the same sign or = 0. Since 0 < fiij, we de¬ 

duce that {np)ij and A {x^p)sj^J‘sj have the same sign or /i^ = 0. Since the sum is independent 
of i, all {np)ij with > 0 have the same sign for 1 < i < n. Solving Eq. (|174|l for ^ij gives 


Mij — 


(Mp) 


>pj%j 


n \ 
2 


e:=i(a^p)., ^^sj 


(176) 


I. If all are non-positive for 1 < J < (, then the sum A Y^^=l{^^p)sj^J■sj is negative. Erom the 

first order derivative of the Lagrangian in Eq. (|169|l, we can compute the second order derivative 


02 L 

dfXijdutj 



n 

2 ^^{f^p)ij f^ij ^ 0 ■ 
i=l 


(177) 


We inserted the expression of Eq. ( |173| l for Tj. Since all mixed second order derivatives are zero, 
the (projected) Hessian of the Lagrangian is diagonal with negative entries. Therefore it is strict 
negative definite. Thus, the second order necessary conditions cannot be fulfilled. The minimum is 
a border point of the constraints. 
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For each j for which all {fJ.p)ij are non-positive for l< j < I, optimization problem Eq. ( |163| ) 
defines a plane that has a normal vector in the positive orthant (hyperoctant). For such a j the corre¬ 
sponding equality constraint defines a hypersphere. Minimization means that the plane containing 
the solution is parallel to the original plane and should be as close to the origin as possible. If we 
move the plane parallel from the origin into the positive orthant, then the first intersection with the 
hypersphere is 


f for j = argmax.{(^p)^~.} 

[ 0 otherwise 


(178) 


This is the solution for with 1 < j < ? to our minimization problem. 

II. If one is positive, then from Eq. ( |174| ) with this {np)ij follows that ^ ^^=l{^J'p)sj^^sj is 

positive, otherwise Eq. ( |174| i has only negative terms on the left hand side. In particular, the second 
order necessary conditions are always fulfilled as Eq. P77| i is positive. For (^p)y < 0 it follows 
from Eq. (|174| i that Ay > 0 and from the KKT conditions that /iy = 0. For {np)ij > 0 it follows 
from Eq. (|174| that /iy > 0 and from the KKT conditions that Ay = 0. Therefore we define; 


. ^ f 0 for (^p)y < 0 

1 for {fJ-p)tj > 0 ’ 


We write the solution as 


— 1 




J2s=li^^p)sj fJ-sj 


Ij Hij 


We now use the equality constraint; 

1 


E4 = = 1- 


i=l 


i=l 


Solving for aj gives; 


Therefore the solution is 


1 



(179) 


(180) 


(181) 


(182) 


(183) 


This finishes the proof. 


□ 


S9.5 Gradient and Scaled Gradient Projection and Projected Newton 
S9.5.1 Gradient Projection Algorithm 

The projected gradient descent or gradient projection algorithm iHiia performs first a gradient 
step and then projects the result to the feasible set. The projection onto the feasible set is denoted 
by P, that is, the map that takes /x into the nearest point (in the L^-norm) in the feasible set to fi. 
The feasible set must be convex, however later we will introduce gradient projection methods for 
non-convex feasible sets. 

The gradient projection method is in our case 

Pk+i = P {pk + ASp^(/Xp - pk)) ■ (184) 

The Lipschitz constant for the gradient is ||S~^||a = emax(Sp ^), the largest eigenvalue of Sp 
The following statement is Theorem 5.4.5 in llTsl . 


33 












Theorem 13 (Theorem 5.4.5 in lITSl '). The sufficient decrease condition 
DKhiQifj’k+i) Up) - DKhiQif^k) Up) < 

(e.g. with a = 10““^) holds for all A such that 


0 < A < 


2(1 

^max 


-Q) 
(Sp') ■ 


Mfc+if 


(185) 


(186) 


Proof. See ifTSl . 


□ 


Theorem 13 guarantees that we can increase the objective by gradient projection in the E-step, 
except the case where we already reached the maximum. 


For a fast upper bound on the maximal eigenvalue we use 

e„,ax(S;') < Tr(S;i) (187) 

and 

emax(S;') < IlFFII^ Il'S'-'IU -1, (188) 

where the latter follows from 

Sp 1 = J + . (189) 

Improved methods for finding an appropriate A by line search methods have been proposed Il45ll4^ . 
We use a search with A = /3* with f = 0,1, 2,... and /3 = 2~^ or /? = 10“^. 

A special version of the gradient projection method is the generalized reduced method El- This 
method is able to solve our optimization problem with equality constraints. The gradient projec¬ 
tion method has been generalized by Rosen to non-linear constraints M- The gradient projection 
algorithm of Rosen can also be used for a region which is not convex. The idea is to linearize the 
nonlinear constraints and solve the problem. Subsequently a restoration move brings the solution 
back to the constraint boundaries. Rosen’s gradient projection method was improved by ESI. These 
methods guarantee that we can increase the objective in the E-step for non-convex feasible sets, 
except the case where we already reached the maximum. These algorithms for non-convex feasible 
sets will only give a local maximum. Also the GAM algorithm will only find a local maximum. 


S9.5.2 Scaled Gradient Projection and Projected Newton Method 

Both the scaled gradient projection algorithm and the projected Newton method were proposed in 
E6l . We follow El. 

The idea is to use a Newton update instead of the a gradient update; 

Mfc+i = P (/Xfc -f A - pu)) ■ (190) 

can be an arbitrary strict positive definite matrix. If we set ff ^ = Sp, then we have a Newton 
update of the projected Newton method Qa. For A = 1 we obtain 

ptfe+i = P(ptp) . (191) 

otherwise 

Pk+i = P((l - X)Pk + A/Xp) . (192) 

The search direction for the unconstrained problem can be rotated by H~^ to be orthogonal to the 
direction of decrease in the inactive directions for the constrained problem. 

To escape this possible problem, an e-active set is introduced which contains all j with pj < e. All 
columns and rows of the Hessian having an index in the e-active set are fixed to ej. After sorting 
the indices of the e-active set together, they form a block which is the sub-identity matrix. H is set 
to the Hessian Sp where the e-active set columns and rows are replaced by unit vectors. 

The following Theorem[l4]is Lemma 5.5.1 in ca. Theorem^^states that the objective decreases 
using the reduced Hessian in the projected Newton method for convex feasible sets. 
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Theorem 14 (Lemma 5.5.1 in ifTSl l. The sufficient decrease condition 

DKhiQifJ-k+i) \\p) - DKhiQiPk) Up) < -aifik - /Xp)^Sp^(/Xfc - fj-k+i) (193) 
holds for all A smaller than a bound depending on H and e. 

Proof. See ifTSl . □ 

In practical applications, a proper A is found by line search. The projected Newton method uses 
A = 1 to set e IIT^ : 

e = liMfc - P(Mp)|| . (194) 

S9.5.3 Combined Method 

Following BtI l46l we use the following very general update rule, which includes the gradient pro¬ 
jection algorithm, the scaled gradient projection algorithm, and the projected Newton method. 

We use following update for the E-step: 

dk+i = P (/ifc A- Aff“^Sp^(/Xp - pk)) , (195) 

Pk+i = P(/Xfc + l{dk+i - pk)) ■ 

We have to project twice since the equality constraint produces a manifold in the parameter space. 
We iterate this update until we see a decrease of the objective in the E-step: 



DKhiQk+i p) — D]<.i,{Qk p) < 0 . 

(196) 

Eor the constraints we have only to optimize the mean vector p to ensure 


DKhiQilJ-k+l) \\ P) — DKhiQiP-k) \\ p) < 0. 

(197) 

Even 

DKL{Q{P-k+l) p) = DKhiQilJ-k) p) 

(198) 

can be sufficient if minimizing = Sp ensures 



DKhiQk+l p) < £*Kl(Q/c p) ■ 

(199) 

We use following schedule: 



1. • if-i = Sp 

• A = 1 



• 7 = 1 

That is 

Pk+l = P (Mp) ■ 

(200) 

2. • if-i = Sp 

• A = 1 

• 7 G (0,1] 

That is 

Pk+i = P((l - -i) Pk + 7P(Mp)) • 

(201) 

3. • if-i = Sp 

• A e (0,1] 

• 7 = 1 

That is 

PkJf-i = P ((1 A)/Xfc -j- A/ip) . 

(202) 


4. • if-i = Sp 
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• A G (0,1] 

• 7 =G (0,1] 
That is 


/Xfe+i = P((l - 7 )/Xfc + 7P((1 - + Xfip)) . (203) 

5. • if-i = R(Sp) 

• A G (0,1] 

• 7 =G (0,1] 

R(Sp) denotes the reduced matrix (Hessian or a positive definite) according to the pro¬ 
jected Newton method or the scaled gradient projection algorithm. For convex feasible sets 
we can guarantee at this level already an increase of the objective at the E-step. 

6. • if-i = I 

• A G (0,1] 

• 7 =G (0,1] 

This is the gradient projection algorithm. In particular we include the generalized reduced 
method and Rosen’s gradient projection method. At this step we guarantee an increase of 
the objective at the E-step even for non-convex feasible sets because we also use complex 
methods for constraint optimization. 

Step 5. ensures an improvement if only using rectifying constraints according to the theory of 
projected Newton methods ca. Step 6. ensures an improvement if using both rectifying constraints 
and normalizing constraints, because we use known methods for constraint optimization. To set 
Hk+i = Mfc is sufficient to increase the objective at the E-step if = Sp decreases the KL 

divergence. However we will not always set fik+i = fJ-k to avoid accumulation points outside the 
solution set. 


SIO Alternative Gaussian Prior 

We assume h is Gaussian with covariance M and mean ^ 

/r ~ A((4,M) . (204) 


We derive the posterior for this prior. 

The likelihood is Gaussian since a affine transformation of a Gaussian random variable is again 


a Gaussian random variable and the convolution of two Gaussians is Gaussian, too. Thus, v = 
Wh -f e is Gaussian if h and e are both Gaussian. For the prior moments we have 

E(/r) = 4 , (205) 

E{hh'^) = M -f ^ , (206) 

var(fi.) = M (207) 

and for the likelihood of v we obtain the moments 

E{v) = (208) 

E{vv^) = W E(fi,/i^) ^ (209) 

= W MW^ + ^ + FF ^ , 

var(t;) = W M W'^ + . (210) 

We need some algebraic identities to derive the posterior. The Woodbury matrix identity gives 

M - MW^ {W MW^ + = (M-^ -I- . (211) 
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( 212 ) 


Multiplying this equation from the left hand side with \I/ ^ IV gives 

^ W (M-i + 

= M - {W MW^ + 'ify^WM 

= 'if-yWMW^ + ’®') (WMW^ + WM- 
^-^WMW^ (WMW^ + ^y^WM 
= {'if^yw MW^ + ’4') - ^-^WMWy (W MW^ + ^y^WM 


= {WMW'^ + ’4') ^WM . 

It follows that 

MW^ {WW^ + a = (M-i + W^'if-^Wy^ a . (213) 

The posterior p{h \ v) is derived from Gaussian conditioning because both the likelihood p{v) 
and the prior p{h) are Gaussian distributed. The conditional distribution p{a \ b) of two random 
variables a and b that both follow a Gaussian distribution is a Gaussian: 

a ~ Af {fla,'^aa) , (214) 

b ^ M {fib,Sbb) , (215) 

Sba = Cov(b, a) , (216) 

Sab = Cov(a, b), (217) 

a I 6 ~ A/” (/To + 'Sab^bb y ~ Vb) 7 ^aa ~ '^ab^bb^ba) ■ (218) 

Therefore we need the second moments between v and h: 

E{vh^) = E{Whh^) + E{eh^) = W {M + ^ f) . (219) 

The covariances between v and h are 

Coy{v,h) = E{vhF) — E(u)E(b,^) (220) 

= W M + y - y = W M , 

Cov{h,v) = E{hv^) - E{h)E{v^)= M W'^ . (221) 

Thus, the mean of p{h \ v) is 

tih\v = ^ + MW^ {W MW'^ + - Wy (222) 

= I + (M-^ + w^^-^wy^ {v + W^) 

= (M-i + w^^-^wy^ (M-1 + 

+ (M-i + w'^^-^wy^ {v - wy 

= (M-i + w^^-^wy^ 


{M-y + V - 

= (M-i + w^^-^wy^ V + M-y). 

The covariance matrix of p{h \ v) is 

S^l^ = M - MW^ (W MW'^ + <lfy^W M (223) 

= (M-i + w^^-^wy^. 

In particular, the variable ^ may be used to enforce more sparseness by setting its components to 
negative values. Since the covariance matrix S^|„ is positive semi-definite, we ensure that 

y (M-^ + w^^-^wy^ I > 0. (224) 

If 4 = —pi (1 is the vector with all components being one), then the largest absolute components 
of must be negative. Thus, ^ = —pi leads to sparser solutions. 
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Sll Hyperparameters Selected for Method Assessment 


The performance of rectified factor networks (RFNs) as unsupervised methods for data representa¬ 
tion was compared with: 

(1) RFN: rectified factor networks, 

(2) RFNn: RFNs without normalization, 

(3) DAE: denoising autoencoders with rectified linear units, 

(4) RBM: restricted Boltzmann machines with Gaussian visible units and hidden binary units, 

(5) FAsp: factor analysis with Jeffrey’s prior {p{z) oc I/ 2 ;) on the hidden units which is sparser than 
a Laplace prior, 

(6) FAlap: factor analysis with Laplace prior on the hidden units, 

(7) ICA: independent component analysis by FastICA ll24l . 

(8) SFA: sparse factor analysis with a Laplace prior on the parameters, 

(9) FA: standard factor analysis, 

(10) PCA: principal component analysis. 

The number of components are fixed to 50, 100, or 150 for each method. The used hyperparameters 
are listed in Tab. IS3I 

Table S3: Hyperparameters of all methods that were used to assess the performance of rectified 
factor networks (RFNs) as unsupervised methods for data representation. 


Method 

Used hyperparameters 

RFN 

{learning rate=0.1, iterations=1000} 

RFNn 

{learning rate=0.1, iterations=1000} 

DAE 

{corruption level=0.2, learning rate=le-04, iterations=1000} 

RBM 

{learning rate=0.01, iterations=1000} 

FAsp 

{iterations=500} 

FAlap 

{iterations=500} 

SFA 

{Laplace weight decay factor=5e-05, iterations=500} 
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S12 Data Set I 


The number of components are fixed to 50, 100 or 150. 

We generated nine different benchmark data sets (D1 to D9), where each data set consists of 100 
instances for averaging the results. Each instance consists of 100 samples and 100 features resulting 
in a 100x100 data matrix. Into these data matrices, structures are implanted as biclusters 0. A 
bicluster is a pattern consisting of a particular number of features which is found in a particular 
number of samples. The size of the bicluster is given by the number of features that form the pattern 
and by the number of samples in which the pattern is found. The data sets had different noise levels 
and different bicluster sizes. We considered large and small bicluster sizes, where large biclusters 
have 20-30 samples and 20-30 features, while small biclusters have 3-8 samples and 3-8 features. 
The signal strength (scaling factor) of a pattern in a sample was randomly chosen according to the 
Gaussian Af (1,1). Finally, to each data matrix background noise was added, where the noise is 
distributed according to a zero-mean Gaussian with standard deviation 1, 5, or 10. The data sets are 
described in Tab. The remaining components of the spanning outer product vectors were drawn 
by Af (0,0.01). 

Table S4; Overview over the datasets. Shown is the background noise (“noise”), the number of large 
biclusters (ni), and the number of small biclusters ( 712 ). 



D1 

D2 

D3 

D4 

D5 

D6 

D7 

D8 

D9 

noise 

1 

5 

10 

1 

5 

10 

1 

5 

10 

ni 

10 

10 

10 

15 

15 

15 

5 

5 

5 

712 

10 

10 

10 

5 

5 

5 

15 

15 

15 
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Table S5: Comparison for 50 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse factor 
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Table S6: Comparison for 100 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse factor 
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Table S7: Comparison for 150 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse factor 
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S13 Data Set II 


This data sets was generate as described in Section [ST^ but instead of drawing the remaining compo¬ 
nents of the spanning outer product vectors fromA^ (0, 0.01), they were now drawn from Af (0, 0.5). 
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Table S 8 : Comparison for 50 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse factor 
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Table S9: Comparison for 100 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse factor 
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Table SIO: Comparison for 150 factors / hidden units extracted by RFN, RFN without normalization (RFNn), denoising autoencoder (DAE), restricted Boltzmann 
machines (RBM), factor analysis with a very sparse prior (FAsp), factor analysis with a Laplace prior (FAlap), independent component analysis (ICA), sparse 
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PC A l±o 0±o — 4±o 0±o — 2±o 0±o — l±o 0±o — 2±o 0±o 





S14 RFN Pretraining for Convolution Nets 


We assess the performance of RFN first layer pretraining on CIFAR-10 and CIFAR-100 for three 
deep convolutional network architectures: (i) the AlexNet 13^ . (ii) Deeply Supervised Networks 
(DSN) 1331, and (iii) our 5-Convolution-Network-In-Network (5C-NIN). 

Both CIFAR datasets contain 60k 32x32 RGB-color images, which were divided into 50k train 
and 10k test sets, split between 10 (CIFARIO) and 100 (CIFARIOO) categories. Both datasets are 
preprocessed by global contrast normalization and ZCA whitening ll48ll . Additionally, the datasets 
were augmented by padding the images with four zero pixels at all borders. For data augmentation, 
at the beginning of every epoch, images in the training set were distorted by random translation and 
random flipping in horizontal and vertical directions. For the AlexNet, we neither preprocessed nor 
augmented the datasets. 

Inspired by the Network In Network approach B9l . we constructed a 5-Convolution-Network-In- 
Network (5C-NIN) architecture with hve convolutional layers, each followed by a 2x2 max-pooling 
layer (stride 1) and a multilayer perceptron (MLP) convolutional layer. ReLUs were used for the 
convolutional layers and dropout for regularization. For weight initialization, learning rates, and 
learning policies we used same strategy as in the AlexNet ll28l . The networks were trained using 
mini-batches of size 100 and 128 for 5C-NIN and AlexNet, respectively. 

For RFN pretraining, we randomly extracted 5x5 patches from the training data to construct 192 
biters for DSN and 5C-NIN while 32 for AlexNet. These biters constitute the brst convolutional 
layer of each network which is then trained using default setting. For assessing the improvement 
by RFNs, we repeated training with randomly initialized weights in the brst layer. The results are 
presented in Tab. |S1 1| For comparison, the lower panel of the table reports the performance of the 
currently top performing networks: Network In Network (NIN, 1491 ), Maxout Networks (MN, BSl l 
and DeepCNiN 150). In all cases pretraining with RFNs decreases the test error rate. 

Table S11: The upper panel shows results of convolutional deep networks with brst layer pretrained 
by RFN (“RFN”) and with brst layer randomly initialized (“org”). The brst column gives the net¬ 
work architecture, namely, AlexNet, Deeply Supervised Networks (DSN), and our 5-Convolution- 
Network-In-Network (5C-NIN). The test error rates are reported (for CIFAR-100 DSN model was 
missing). Currently best performing networks Network In Network (NIN), Maxout Networks (MN), 
and DeepCNiN are reported in the lower panel. In all cases pretraining with RFNs decreased the 
test error rate. 


Dataset 

CIFAR-10 

CIFAR-100 



org 

RFN 

org RFN 

augmented 

AlexNet 

18.21 

18.04 46.18 45.80 


DSN 

7.97 

7.74 

34.57 - 

V 

5C-NIN 

7.81 

7.63 

29.96 29.75 

V 

NIN 

8.81 

- 

35.68 - 

V 

MN 

9.38 

- 

38.57 - 

V 

DeepCNiN 6.28 

- 

24.30 - 

V 
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